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Preface 


Exact Tests™ is a statistical package for analyzing continuous or categorical data by 
exact methods. The goal in Exact Tests is to enable you to make reliable inferences 
when your data are small, sparse, heavily tied, or unbalanced and the validity of the 
corresponding large sample theory is in doubt. This is achieved by computing exact p 
values for a very wide class of hypothesis tests, including one-, two-, and K- sample 
tests, tests for unordered and ordered categorical data, and tests for measures of asso- 
ciation. The statistical methodology underlying these exact tests is well established in 
the statistical literature and may be regarded as a natural generalization of Fisher’s ex- 
act test for the single 2 x 2 contingency table. It is fully explained in this user manual. 
The real challenge has been to make this methodology operational through software 
development. Historically, this has been a difficult task because the computational de- 
mands imposed by the exact methods are rather severe. We and our colleagues at the 
Harvard School of Public Health have worked on these computational problems for 
over a decade and have developed exact and Monte Carlo algorithms to solve them. 
These algorithms have now been implemented in Exact Tests. For small data sets, the 
algorithms ensure quick computation of exact p values. If a data set is too large for the 
exact algorithms, Monte Carlo algorithms are substituted in their place in order to es- 
timate the exact p values to any desired level of accuracy. 

These numerical algorithms are fully integrated into the IBM® SPSS® Statistics 
system. Simple selections in the Nonparametric Tests and Crosstabs dialog boxes al- 
low you to obtain exact and Monte Carlo results quickly and easily. 
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Getting Started 


The Exact Tests option provides two new methods for calculating significance levels for 
the statistics available through the Crosstabs and Nonparametric Tests procedures. These 
new methods, the exact and Monte Carlo methods, provide a powerful means for obtain- 
ing accurate results when your data set is small, your tables are sparse or unbalanced, the 
data are not normally distributed, or the data fail to meet any of the underlying assump- 
tions necessary for reliable results using the standard asymptotic method. 


The Exact Method 


By default, IBM® SPSS® Statistics calculates significance levels for the statistics in the 
Crosstabs and Nonparametric Tests procedures using the asymptotic method. This 
means that p values are estimated based on the assumption that the data, given a suffi- 
ciently large sample size, conform to a particular distribution. However, when the data 
set is small, sparse, contains many ties, is unbalanced, or is poorly distributed, the asymp- 
totic method may fail to produce reliable results. In these situations, it is preferable to cal- 
culate a significance level based on the exact distribution of the test statistic. This enables 
you to obtain an accurate p value without relying on assumptions that may not be met by 
your data. 

The following example demonstrates the necessity of calculating the p value for 
small data sets. This example is discussed in detail in Chapter 2. 


© Copyright IBM Corporation. 1989, 2013 


2 


Chapter 1 


Figure 1.1 shows results from an entrance examination for fire fighters in a small 
township. This data set compares the exam results based on the race of the applicant. 


Figure 1.1 Fire fighter entrance exam results 


Test Results * Race of Applicant Crosstabulation 


Count 


Race of Applicant 


White Black Asian Hispanic 

Test Results | Pass 5 2 2 
No Show 1 1 
Fail 2 3 4 


The data show that all five white applicants received a Pass result, whereas the results 
for the other groups are mixed. Based on this, you might want to test the hypothesis that 
exam results are not independent of race. To test this hypothesis, you can run the Pearson 
chi-square test of independence, which is available from the Crosstabs procedure. The 
results are shown in Figure 1.2. 


Figure 1.2 Pearson chi-square test results for fire fighter data 


Chi-Square Tests 


Asymp. 
Sig. 
Value df (2-tailed) 
Pearson 1 
Chi-Square 11.556 6 .073 


1. 12 cells (100.0%) have expected count less than 5. 
The minimum expected count is .50. 


Because the observed significance of 0.073 is larger than 0.05, you might conclude that 
exam results are independent of race of examinee. However, notice that the data contains 
only twenty observations, that the minimum expected frequency is 0.5, and that all 12 
of the cells have an expected frequency of less than 5. These are all indications that the 
assumptions necessary for the standard asymptotic calculation of the significance level 
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for this test may not have been met. Therefore, you should obtain exact results. The ex- 
act results are shown in Figure 1.3. 


Figure 1.3 Exact results of Pearson chi-square test for fire fighter data 


Chi-Square Tests 


Asymp. 
Sig. Exact Sig. 
Value df (2-tailed) | (2-tailed) 
Pearson 1 
Chi-Square 11.556 6 .073 .040 


1. 12 cells (100.0%) have expected count less than 5. The 
minimum expected count is .50. 


The exact p value based on Pearson’s statistic is 0.040, compared to 0.073 for the as- 
ymptotic value. Using the exact p value, the null hypothesis would be rejected at the 
0.05 significance level, and you would conclude that there is evidence that the exam 
results and race of examinee are related. This is the opposite of the conclusion that 
would have been reached with the asymptotic approach. This demonstrates that when 
the assumptions of the asymptotic method cannot be met, the results can be unreliable. 
The exact calculation always produces a reliable result, regardless of the size, distribu- 
tion, sparseness, or balance of the data. 


The Monte Carlo Method 


Although exact results are always reliable, some data sets are too large for the exact p 
value to be calculated, yet don’t meet the assumptions necessary for the asymptotic 
method. In this situation, the Monte Carlo method provides an unbiased estimate of the 
exact p value, without the requirements of the asymptotic method. (See Table 1.1 and 
Table 1.2 for details.) The Monte Carlo method is a repeated sampling method. For any 
observed table, there are many tables, each with the same dimensions and column and 
row margins as the observed table. The Monte Carlo method repeatedly samples a spec- 
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ified number of these possible tables in order to obtain an unbiased estimate of the true 
p value. Figure 1.4 displays the Monte Carlo results for the fire fighter data. 


Figure 1.4 Monte Carlo results of the Pearson chi-square test for fire fighter data 


Chi-Square Tests 


Monte Carlo Significance (2-tailed) 


3 . 
Asymp. 99% Confidence Interval 
Sig. Lower Upper 
Value df (2-tailed) Sig. Bound Bound 
2 
peaeon 11.556" 6 073 041 036 046 
Chi-Square 


1. 12 cells (100.0%) have expected count less than 5. The minimum expected count is .50. 
2. Based on 10000 and seed 2000000 ... 


The Monte Carlo estimate of the p value is 0.041. This estimate was based on 10,000 
samples. Recall that the exact p value was 0.040, while the asymptotic p value is 0.073. 
Notice that the Monte Carlo estimate is extremely close to the exact value. This demon- 
strates that if an exact p value cannot be calculated, the Monte Carlo method produces 
an unbiased estimate that is reliable, even in circumstances where the asymptotic p value 
is not. 


Getting Started 5 


When to Use Exact Tests 


Calculating exact results can be computationally intensive, time-consuming, and can 
sometimes exceed the memory limits of your machine. In general, exact tests can be per- 
formed quickly with sample sizes of less than 30. Table 1.1 and Table 1.2 provide a 
guideline for the conditions under which exact results can be obtained quickly. In Table 
1.2, r indicates rows, and c indicates columns in a contingency table. 


Table 1.1 Sample sizes (N) at which the exact p values for nonparametric tests are computed 


quickly 
One-sample inference 
Chi-square goodness-of-fit test N<30 
Binomial test and confidence interval N< 100, 000 
Runs test N20 
One-sample Kolmogorov-Smirnov test NS30 


Two-related-sample inference 


Sign test N<50 
Wilcoxon signed-rank test N<50 
McNemar test N< 100, 000 
Marginal homogeneity test N<50 


Two-independent-sample inference 


Mann-Whitney test N<30 
Kolmogorov-Smirnov test N<30 
Wald-Wolfowitz runs test N<30 


K-related-sample inference 


Friedman’s test Ns30 
Kendall’s W Ns<30 
Cochran’s Q test N<30 


K-independent-sample inference 


Median test Ns50 
Kruskal-Wallis test Ns<15,K<4 
Jonckheere-Terpstra test Ns<20,K<4 


Two-sample median test N< 100, 000 
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Table 1.2 Sample sizes (N) and table dimensions (r, c) at which the exact p values for 


Crosstabs tests are computed quickly 
2 x 2 contingency tables (obtained by selecting 
chi-square) 
Pearson chi-square test 
Fisher’s exact test 
Likelihood-ratio test 


r x c contingency tables (obtained by selecting 
chi-square) 


Pearson chi-square test 

Fisher’s exact test 
Likelihood-ratio test 
Linear-by-linear association test 


Correlations 


Pearson’s product-moment correlation coefficient 
Spearman’s rank-order correlation coefficient 


Ordinal data 


Kendall’s tau-b 
Kendall’s tau-c 
Somers’ d 
Gamma 


Nominal data 


Contingency coefficients 
Phi and Cramér’s V 
Goodman and Kruskal’s tau 
Uncertainty coefficient 


Kappa 


N<¥ 100, 000 
N<¥ 100, 000 
N<¥ 100, 000 


N<30 and min{r,c} <3 
N30 and min{r,c} <3 
N30 and min{r,c} <3 
N<¥<30and min{r,c}<3 


NS20 andrs3 
N20 andr<3 
NS30 

N20 andrs<3 


N<30 and min{r,c} <3 
N<30 and min{r,c} <3 
N#s20 andrs3 

N<30 and min{r,c} <3 


NS30 andc<5 


How to Obtain Exact Statistics 


The exact and Monte Carlo methods are available for Crosstabs and all of the Nonpara- 


metric tests. 
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To obtain exact statistics, open the Crosstabs dialog box or any of the Nonparametric 
Tests dialog boxes. The Crosstabs and Tests for Several Independent Samples dialog 
boxes are shown in Figure 1.5. 


Figure 1.5 Crosstabs and Nonparametric Tests dialog boxes 


subject 


zi) 


Row(s): 


= Crosstabs 


= Tests for Several Independent Samples 


ox_] 


Column(s): 


‘toxicity 


Cancel | 


subject Test Variable List: 


a 


Grouping Variable: 
aosett 4 


Layer 1 of 1 


< 
L4] 
5 


a) 


Suppress tables 


Statistics... | Cells... 


i 
g 
3 
& 


Define Range... | 
[Test Type 


& Kruskal-Wallis H Median 


Jonckheere-Terpstra 


Click here for exact tests 


e Select the statistics that you want to calculate. To select statistics in the Crosstabs 


dialog box, click Statistics. 


e To select the exact or Monte Carlo method for computing the significance level of 
the selected statistics, click Exact in the Crosstabs or Nonparametric Tests dialog box. 
This opens the Exact Tests dialog box, as shown in Figure 1.6. 
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Figure 1.6 Exact Tests dialog box 


= Exact Tests 


@ Asymptotic only 


© Monte Carlo 


Confidence level: 


Number of samples: 


Oo Exact 


X] Time limit per test: minutes 


Exact method will be used instead of Monte 
Carlo when computational limits allow. 


You can choose one of the following methods for computing statistics. The method you 
choose will be used for all selected statistics. 


Asymptotic only. Calculates significance levels using the asymptotic method. This pro- 
vides the same results that would be provided without the Exact Tests option. 


Monte Carlo. Provides an unbiased estimate of the exact p value and displays a confi- 
dence interval using the Monte Carlo sampling method. Asymptotic results are also dis- 
played. The Monte Carlo method is less computationally intensive than the exact 
method, so results can often be obtained more quickly. However, if you have chosen the 
Monte Carlo method, but exact results can be calculated quickly for your data, they will 
be provided. See Appendix A for details on the circumstances under which exact, rather 
than Monte Carlo, results are provided. Note that, within a session, the Monte Carlo 
method relies on a random number seed that changes each time you run the procedure. 
If you want to duplicate your results, you should set the random number seed every time 
you use the Monte Carlo method. See “How to Set the Random Number Seed” on p. 9 
for more information. 


Confidence level. Specify a confidence level between 0.01 and 99.9. The default value 
is 99. 


Number of samples. Specify a number between | and 1,000,000,000 for the number 
of samples used in calculating the Monte Carlo approximation. The default is 10,000. 
Larger numbers of samples produce more reliable estimates of the exact p value but 
also take longer to calculate. 


Getting Started 9 


Exact. Calculates the exact p value. Asymptotic results are also displayed. Because com- 
puting exact statistics can be time-consuming, you can set a limit on the amount of time 
allowed for each test. 


Time limit per test. Enter the maximum time allowed for calculating each test. The 
time limit can be between | and 9,999,999 minutes. The default is five minutes. If the 
time limit is reached, the test is terminated, no exact results are provided, and the ap- 
plication proceeds to the next test in the analysis. If a test exceeds a set time limit of 
30 minutes, it is recommended that you use the Monte Carlo, rather than the exact, 
method. 


Calculating the exact p value can be memory-intensive. If you have selected the exact 
method and find that you have insufficient memory to calculate results, you should first 
close any other applications that are currently running in order to make more memory 
available. If you still cannot obtain exact results, use the Monte Carlo method. 


Additional Features Available with Command Syntax 


Command syntax allows you to: 

e Exceed the upper time limit available through the dialog box. 

e Exceed the maximum number of samples available through the dialog box. 
e Specify values for the confidence interval with greater precision. 


Nonparametric Tests 


As of release 6.1, two new nonparametric tests became available, the Jonckheere- 
Terpstra test and the marginal homogeneity test. The Jonckheere-Terpstra test can be 
obtained from the Tests for Several Independent Samples dialog box, and the mar- 
ginal homogeneity test can be obtained from the Two-Related-Samples Tests dialog 
box. 


How to Set the Random Number Seed 


Monte Carlo computations use the pseudo-random number generator, which begins with 
a seed, a very large integer value. Within a session, the application uses a different seed 
each time you generate a set of random numbers, producing different results. If you want 
to duplicate your results, you can reset the seed value. Monte Carlo output always dis- 
plays the seed used in that analysis, so that you can reset the seed to that value if you 


want to repeat an analysis. To reset the seed, open the Random Number Seed dialog box 
from the Transform menu. The Random Number Seed dialog box is shown in Figure 1.7. 


Figure 1.7 Random Number Seed dialog box 


2000000 


C Random Seed 


Set seed to. Specify any positive integer value up to 999,999,999 as the seed value. The 


seed is reset to the specified value each time you open the dialog box and click on OK. 
The default seed value is 2,000,000. 


To duplicate the same series of random numbers, you should set the seed before you gen- 
erate the series for the first time. 


Random seed. Sets the seed to a random value chosen by your system. 


Pivot Table Output 


With this release of Exact Tests, output appears in pivot tables. Many of the tables shown 
in this manual have been edited by pivoting them, by hiding categories that are not rel- 
evant to the current discussion, and to show more decimal places than appear by default. 


Exact Tests 


A fundamental problem in statistical inference is summarizing observed data in terms 
of a p value. The p value forms part of the theory of hypothesis testing and may be 
regarded an index for judging whether to accept or reject the null hypothesis. A very 
small p value is indicative of evidence against the null hypothesis, while a large p value 
implies that the observed data are compatible with the null hypothesis. There is a long 
tradition of using the value 0.05 as the cutoff for rejection or acceptance of the null 
hypothesis. While this may appear arbitrary in some contexts, its almost universal 
adoption for testing scientific hypotheses has the merit of limiting the number of false- 
positive conclusions to at most 5%. At any rate, no matter what cutoff you choose, the 
p value provides an important objective input for judging if the observed data are 
statistically significant. Therefore, it is crucial that this number be computed 
accurately. 

Since data may be gathered under diverse, often nonverifiable, conditions, it is 
desirable, for p value calculations, to make as few assumptions as possible about the 
underlying data generation process. In particular, it is best to avoid making 
assumptions about the distribution, such as that the data came from a normal 
distribution. This goal has spawned an entire field of statistics known as nonparametric 
statistics. In the preface to his book, Nonparametrics: Statistical Methods Based on 
Ranks, Lehmann (1975) traces the earliest development of a nonparametric test to 
Arbuthnot (1710), who came up with the remarkably simple, yet popular, sign test. In 
this century, nonparametric methods received a major impetus from a seminal paper by 
Frank Wilcoxon (1945) in which he developed the now universally adopted Wilcoxon 
signed-rank test and the Wilcoxon rank-sum test. Other important early research in the 
field of nonparametric methods was carried out by Friedman (1937), Kendall (1938), 
Smirnov (1939), Wald and Wolfowitz (1940), Pitman (1948), Kruskal and Wallis 
(1952), and Chernoff and Savage (1958). One of the earliest textbooks on 
nonparametric statistics in the behavioral and social sciences was Siegel (1956). 

The early research, and the numerous papers, monographs and textbooks that 
followed in its wake, dealt primarily with hypothesis tests involving continuous 
distributions. The data usually consisted of several independent samples of real 
numbers (possibly containing ties) drawn from different populations, with the 
objective of making distribution-free one-, two-, or K-sample comparisons, performing 
goodness-of-fit tests, and computing measures of association. Much earlier, Karl 
Pearson (1900) demonstrated that the large-sample distribution of a test statistic, based 
on the difference between the observed and expected counts of categorical data 
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generated from multinomial, hypergeometric, or Poisson distributions is chi-square. 
This work was found to be applicable to a whole class of discrete data problems. It was 
followed by significant contributions by, among others, Yule (1912), R. A. Fisher 
(1925, 1935), Yates (1984), Cochran (1936, 1954), Kendall and Stuart (1979), and 
Goodman (1968) and eventually evolved into the field of categorical data analysis. An 
excellent up-to-date textbook dealing with this rapidly growing field is Agresti (1990). 

The techniques of nonparametric and categorical data inference are popular mainly 
because they make only minimal assumptions about how the data were generated— 
assumptions such as independent sampling or randomized treatment assignment. For 
continuous data, you do not have to know the underlying distribution giving rise to the 
data. For categorical data, mathematical models like the multinomial, Poisson, or 
hypergeometric model arise naturally from the independence assumptions of the sampled 
observations. Nevertheless, for both the continuous and categorical cases, these methods 
do require one assumption that is sometimes hard to verify. They assume that the data set 
is large enough for the test statistic to converge to an appropriate limiting normal or chi- 
square distribution. P values are then obtained by evaluating the tail area of the limiting 
distribution, instead of actually deriving the true distribution of the test statistic and then 
evaluating its tail area. P values based on the large-sample assumption are known as 
asymptotic p values, while p values based on deriving the true distribution of the test 
statistic are termed exact p values. While exact p values are preferred for scientific 
inference, they often pose formidable computational problems and so, as a practical 
matter, asymptotic p values are used in their place. For large and well-balanced data sets, 
this makes very little difference, since the exact and asymptotic p values are very similar. 
But for small, sparse, unbalanced, and heavily tied data, the exact and asymptotic p values 
can be quite different and may lead to opposite conclusions concerning the hypothesis of 
interest. This was a major concern of R. A. Fisher, who stated in the preface to the first 
edition of Statistical Methods for Research Workers (1925): 


The traditional machinery of statistical processes is wholly unsuited to the needs of 
practical research. Not only does it take a cannon to shoot a sparrow, but it misses the 
sparrow! The elaborate mechanism built on the theory of infinitely large samples is not 
accurate enough for simple laboratory data. Only by systematically tackling small 
problems on their merits does it seem possible to apply accurate tests to practical data. 
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The example of a sparse 3 x 9 contingency table, shown in Figure 2.1, demonstrates 
that Fisher’s concern was justified. 


Figure 2.1 Sparse 3 x 9 contingency table 


VAR1 * VAR2 Crosstabulation 


VAR2 


VAR1 


1 7 1 1 
2 1 1 1 1 1 1 1 
3 8 


The Pearson chi-square test is commonly used to test for row and column independence. 
For the above table, the results are shown in Figure 2.2. 


Figure 2.2 Pearson chi-square test results for sparse 3 x 9 table 


Chi-Square Tests 


Asymp. 
Sig. 
Value df (2-tailed) 
Pearson 1 
Chi-Square 22.286 16 134 


| 95 cells (92.6%) have expected count less than 5. 
The minimum expected count is .29. 


The observed value of the Pearson’s statistic is x = 22.29, and the asymptotic p value 
is the tail area to the right of 22.29 from a chi-square distribution with 16 degrees of 
freedom. This p value is 0.134, implying that it is reasonable to assume row and column 
independence. With Exact Tests, you can also compute the tail area to the right of 22.29 
from the exact distribution of Pearson’s statistic. The exact results are shown in Figure 
2.3. 


Figure 2.3. Exact results of Pearson chi-square test for sparse 9 x 3 table 


Chi-Square Tests 


Asymp. 
Sig. Exact Sig. 
Value df (2-tailed) | (2-tailed) 
Pearson 1 
Chi-Square 22.286 16 134 .001 


1. 25 cells (92.6%) have expected count less than 5. The 
minimum expected count is .29. 
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The exact p value obtained above is 0.001, implying that there is a strong row and col- 
umn interaction. Chapter 9 discusses this and related tests in detail. 

The above example highlights the need to compute the exact p value, rather than 
relying on asymptotic results, whenever the data set is small, sparse, unbalanced, or 
heavily tied. The trouble is that it is difficult to identify, a priori, that a given data set 
suffers from these obstacles to asymptotic inference. Bishop, Fienberg, and Holland 
(1975), express the predicament in the following way. 


The difficulty of exact calculations coupled with the availability of normal approxi- 
mations leads to the almost automatic computation of asymptotic distributions and 
moments for discrete random variables. Three questions may be asked by a potential 
user of these asymptotic calculations: 


1. How does one make them? What are the formulas and techniques for getting the 
answers? 


2. How does one justify them? What conditions are needed to ensure that these for- 
mulas and techniques actually produce valid asymptotic results? 


3. How does one relate asymptotic results to pre-asymptotic situations? How close 
are the answers given by an asymptotic formula to the actual cases of interest 
involving finite samples? 


These questions differ vastly in the ease with which they may be answered. The 
answer to (1) usually requires mathematics at the level of elementary calculus. 
Question (2) is rarely answered carefully, and is typically tossed aside by a remark of 
the form ‘...assuming that higher order terms may be ignored...’ Rigorous answers to 
question (2) require some of the deepest results in mathematical probability theory. 
Question (3) is the most important, the most difficult, and consequently the least 
answered. Analytic answers to question (3) are usually very difficult, and it is more 
common to see reported the result of a simulation or a few isolated numerical 
calculations rather than an exhaustive answer. 


The concerns expressed by R. A. Fisher and by Bishop, Fienberg, and Holland can be 
resolved if you directly compute exact p values instead of replacing them with their 
asymptotic versions and hoping that these will be accurate. Fisher himself suggested the 
use of exact p values for 2 x2 tables (1925) as well as for data from randomized 
experiments (1935). Exact Tests computes an exact p value for practically every 
important nonparametric test on either continuous or categorical data. This is achieved 
by permuting the observed data in all possible ways and comparing what was actually 
observed to what might have been observed. Thus exact p values are also known as 
permutational p values. The following two sections illustrate through concrete examples 
how the permutational p values are computed. 
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Pearson Chi-Square Test for a 3 x 4 Table 
Figure 2.4 shows results from an entrance examination for fire fighters in a small township. 


Figure 2.4 Fire fighter entrance exam results 


Test Results * Race of Applicant Crosstabulation 


Count 
Race of Applicant 
White Black Asian Hispanic 
Test Results | Pass 5 2 2 
No Show 1 
Fail 2 3 4 


The table shows that all five white applicants received a Pass result, whereas the results 
for the other groups are mixed. Is this evidence that entrance exam results are related to 
race? Note that while there is some evidence of a pattern, the total number of observa- 
tions is only twenty. Null and alternative hypotheses might be formulated for these data 
as follows: 


Null Hypothesis: Exam results and race of examinee are independent. 
Alternative Hypothesis: Exam results and race of examinee are not independent. 


To test the hypothesis of independence, use the Pearson chi-square test of independence, 
available in the Crosstabs procedure. To get the results shown in Figure 2.5, the test was 
conducted at the 0.05 significance level: 


Figure 2.5 Pearson chi-square test results for fire fighter data 


Chi-Square Tests 


Asymp. 
Sig. 
Value df (2-tailed) 
Pearson 1 
Chi-Square 11.556 6 073 


1. 12 cells (100.0%) have expected count less than 5. 
The minimum expected count is .50. 


Because the observed significance of 0.073 is larger than 0.05, you might conclude that 
the exam results are independent of the race of the examinee. However, notice that table 
reports that the minimum expected frequency is 0.5, and that all 12 of the cells have an 
expected frequency that is less than five. 

That is, the application warns you that all of the cells in the table have small expected 
counts. What does this mean? Does it matter? 


16 


Chapter 2 


Recall that the Pearson chi-square statistic, X a is computed from the observed and 
the expected counts under the null hypothesis of independence as follows: 


r c oe ND 
2 (xi - Xi) : 
X= SS Equation 2.1 
i Paar 4 
i=l j=l J 


where x,; is the observed count, and 
xi = (mn)/N Equation 2.2 


is the expected count incell (i, /) ofan rx c contingency table whose row margins are 
(m,,m),...m,), column margins are (,,7,,...n,), and total sample size is N. 
Statistical theory shows that, under the null hypothesis, the random variable 
asymptotically follows the theoretical chi-square distribution with (r—1)x(c-1) 
degrees of freedom. Therefore, the asymptotic p value is 


Pr(x7 = 11.55556) = 0.07265 Equation 2.3 


where ‘a is a random variable following a chi-square distribution with 6 degrees of 
freedom. 


The term asymptotically means “given a sufficient sample size,” though it is not easy 
to describe the sample size needed for the chi-square distribution to approximate the 
exact distribution of the Pearson statistic. 


One rule of thumb is: 


e The minimum expected cell count for all cells should be at least 5 (Cochran, 1954). 
The problem with this rule is that it can be unnecessarily conservative. 


Another rule of thumb is: 


e For tables larger than 2 x 2 , a minimum expected count of | is permissible as long as 
no more than about 20% of the cells have expected values below 5 (Cochran, 1954). 


While these and other rules have been proposed and studied, no simple rule covers all 
cases. (See Agresti, 1990, for further discussion.) In our case, considering sample size, 
number of cells relative to sample size, and small expected counts, it appears that relying 
on an asymptotic result to compute a p value might be problematic. 

What if, instead of relying on the distribution of y% , it were possible to use the true 
sampling distribution of X” and thereby produce an exact p value? Using Exact Tests, 
you can do that. The following discussion explains how this p value is computed, and 
why it is exact. For technical details, see Chapter 9. Consider the observed 3 x 4 
crosstabulation (see Figure 2.4) relative to a reference set of other 3 x 4 tables that are 
like it in every possible respect, except in terms of their reasonableness under the null 
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hypothesis. It is generally accepted that this reference set consists of all 3 x 4 tables of 
the form shown below and having the same row and column margins as Figure 2.4. (see, 
for example, Fisher, 1973, Yates, 1984, Little, 1989, and Agresti, 1992). 


X11 X19 X13 X14 9 
X91 Xy9 X93 X74 2 
X31 X39 X33 X34 9 

5 5 5 5 20 


This is a reasonable choice for a reference set, even when these margins are not naturally 
fixed in the original data set, because they do not contain any information about the null 
hypothesis being tested. The exact p value is then obtained by identifying all of the 
tables in this reference set for which Pearson’s statistic equals or exceeds 11.55556, the 
observed statistic, and summing their probabilities. This is an exact p value because the 
probability of any table, {x,;}, in the above reference set of tables with fixed margins 
can be computed exactly under the null hypothesis. It can be shown to be the 
hypergeometric probability 


‘ 
1; = 17/! NT; — y7mj! 


P( {X jj ‘y= 7 Equation 2.4 
NUT; J, — 1%j' 
For example, the table 
5 2 2 0 9 
0 0 0 2 2 
0 3 3 3 9 
5 5 5 5 20 


is a member of the reference set. Applying Equation 2.1 to this table yields a value of 
x * = 14.67 for Pearson’s statistic. Since this value is greater than the value 
XxX’ = 11. 55556, this member of the reference set is regarded as more extreme than 
Figure 2.4. Its exact probability, calculated by Equation 2.4, is 0.000108, and will con- 
tribute to the exact p value. The following table 


4 3 2 0 9 
1 0 0 1 2 
0 2 3 4 9 
5 5 - 5 20 


is another member of the reference set. You can easily verify that its Pearson statistic is 
X? = 9.778, which is less than 11.55556. Therefore, this table is regarded as less 
extreme than the observed table and does not count towards the p value. In principle, 
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you can repeat this analysis for every single table in the reference set, identify all those 
that are at least as extreme as the original table, and sum their exact hypergeometric 
probabilities. The exact p value is this sum. 

Exact Tests produces the following result: 


2 
Pr(X 2 11.55556) = 0.0398 Equation 2.5 


The exact results are shown in Figure 2.6. 


Figure 2.6 Exact results of the Pearson chi-square test for fire fighter data 


Chi-Square Tests 


Asymp. 
Sig. Exact Sig. 
Value df (2-tailed) | (2-tailed) 
Pearson 1 
Chi-Square 11.556 6 .073 .040 


1. 12 cells (100.0%) have expected count less than 5. The 
minimum expected count is .50. 


The exact p value based on Pearson’s statistic is 0.040. At the 0.05 level of significance, 
the null hypothesis would be rejected and you would conclude that there is evidence that 
the exam results and race of examinee are related. This conclusion is the opposite of the 
conclusion that would be reached with the asymptotic approach, since the latter 
produced a p value of 0.073. The asymptotic p value is only an approximate estimate of 
the exact p value. Kendall and Stuart (1979) have proved that as the sample size goes 
to infinity, the exact p value (see Equation 2.5) converges to the chi-square based p value 
(see Equation 2.3). Of course, the sample size for the current data set is not infinite, and 
you can observe that this asymptotic result has fared rather poorly. 


Fisher’s Exact Test for a 2 x 2 Table 


It could be said that Sir R. A. Fisher was the father of exact tests. He developed what is 
popularly known as Fisher’s exact test for a single 2x2 contingency table. His 
motivating example was as follows (see Agresti, 1990, for a related discussion). When 
drinking tea, a British woman claimed to be able to distinguish whether milk or tea was 
added to the cup first. In order to test this claim, she was given eight cups of tea. In four 
of the cups, tea was added first, and in four of the cups, milk was added first. The order 
in which the cups were presented to her was randomized. She was told that there were 
four cups of each type, so that she should make four predictions of each order. The 
results of the experiment are shown in Figure 2.7. 
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Figure 2.7 Fisher’s tea-tasting experiment 


GUESS * POUR Crosstabulation 


POUR 
Milk Tea Total 
GUESS | Milk Count 3 1 4 
Exped 2.0 2.0 4.0 
Count 
Tea Count 1 3 4 
Expected 2.0 2.0 4.0 
Count 
Total Count 4 4 8 
Expedice 4.0 4.0 8.0 
Count 


Given the woman’s performance in the experiment, can you conclude that she could 
distinguish whether milk or tea was added to the cup first? Figure 2.7 shows that she 
guessed correctly more times than not, but on the other hand, the total number of trials 
was not very large, and she might have guessed correctly by chance alone. Null and 
alternative hypotheses can be formulated as follows: 


Null Hypothesis: The order in which milk or tea is poured into a cup and the taster’s guess 
of the order are independent. 


Alternative Hypothesis: The taster can correctly guess the order in which milk or tea is 
poured into a cup. 


Note that the alternative hypothesis is one-sided. That is, although there are two 
possibilities—that the woman guesses better than average or she guesses worse than 
average—we are only interested in detecting the alternative that she guesses better than 
average. 
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The Pearson chi-square test of independence can be calculated to test this hypothesis. 
This example tests the alternative hypothesis at the 0.05 significance level. Results are 
shown in Figure 2.8. 


Figure 2.8 Pearson chi-square test results for tea-tasting experiment 


Chi-Square Tests 


Asymp. 
Sig. 
Value df (2-tailed) 
Pearson 2 
Chi-Square 2.000 1 .157 


2. 4 cells (100.0%) have expected count less than 5. 
The minimum expected count is 2.00. 


The reported significance, 0.157, is two-sided. Because the alternative hypothesis is 
one-sided, you might halve the reported significance, thereby obtaining 0.079 as the 
observed p value. Because the observed p value is greater than 0.05, you might conclude 
that there is no evidence that the woman can correctly guess tea-milk order, although the 
observed level of 0.079 is only marginally larger than the 0.05 level of significance used 
for the test. 

It is easy to see from inspection of Figure 2.7 that the expected cell count under the 
null hypothesis of independence is 2 for every cell. Given the popular rules of thumb 
about expected cell counts cited above, this raises concern about use of the one-degree- 
of-freedom chi-square distribution as an approximation to the distribution of the Pearson 
chi-square statistic for the above table. Rather than rely on an approximation that has an 
asymptotic justification, suppose you can instead use an exact approach. 

For the 2 x 2 table, Fisher noted that under the null hypothesis of independence, if 
you assume fixed marginal frequencies for both the row and column marginals, then the 
hypergeometric distribution characterizes the distribution of the four cell counts in the 
2x2 table. This fact enables you to calculate an exact p value rather than rely on an 
asymptotic justification. 

Let the generic four-fold table, {x;,} , take the form 


X11 X12 my 
X91 X72 My 
ny Ny | N 


with (x1), X19, X21, X27) being the four cell counts; m, and m,, the row totals; n, and 
n,, the column totals; and JN, the table total. If you assume the marginal totals as given, 
the value of x,, determines the other three cell counts. Assuming fixed marginals, the 
distribution of the four cell counts follows the hypergeometric distribution, stated here 
in terms of x,,: 
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Pr({x,;}) = ——————— Equation 2.6 


The p value for Fisher’s exact test of independence in the 2 x2 table is the sum of 
hypergeometric probabilities for outcomes at least as favorable to the alternative 
hypothesis as the observed outcome. 

Let’s apply this line of thought to the tea drinking problem. In this example, the 
experimental design itself fixes both marginal distributions, since the woman was asked 
to guess which four cups had the milk added first and therefore which four cups had the 
tea added first. So, the table has the following general form: 


Pour 
Guess Row Total 
Milk Tea 
Milk X44 X19 4 
Tea X44 X99 4 
Col_Total 4 4 8 


Focusing on x,,, this cell count can take the values 0, 1, 2, 3, or 4, and designating a 
value for x,, determines the other three cell values, given that the marginals are fixed. 
In other words, assuming fixed marginals, you could observe the following tables with 
the indicated probabilities: 


Table Pr(Table) p value 
x1, = 0 0 4 4 0.014 1.000 
4 
xy, = 1 1 3 4 0.229 0.986 
4 4 
Xy, = 2 2 2 4 0.514 0.757 
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Table Pr(Table) p value 
2 2: 
4 4 
X14, = 3 3 1 4 0.229 0.243 
4 
4 4 
x1, =4 4 0 4 0.014 0.014 


Figure 2.9 Exact results of the Pearson chi-square test for tea-tasting experiment 


Chi-Square Tests 


Asymp. 
Sig. 
(2-tailed) 


Exact Sig. 
(2-tailed) 


Exact Sig. 
(1-tailed) 


Pearson 
Chi-Square 


157 486 .243 


2. 4 cells (100.0%) have expected count less than 5. The minimum expected 
count is 2.00. 


The probability of each possible table in the reference set of 2x2 tables with the 
observed margins is obtained from the hypergeometric distribution formula shown in 
Equation 2.6. The p values shown above are the sums of probabilities for all outcomes 
at least as favorable (in terms of guessing correctly) as the one in question. For example, 
since the table actually observed has x,, = 3, the exact p value is the sum of 
probabilities of all of the tables for which x,, equals or exceeds 3. The exact results are 
shown in Figure 2.9. 


The exact result works out to 0.229 + 0.014 = 0.243. Given such a relatively large p 
value, you would conclude that the woman’s performance does not furnish sufficient 
evidence that she can correctly guess milk-tea pouring order. Note that the asymptotic p 
value for the Pearson chi-square test of independence was 0.079, a dramatically different 
number. The exact test result leads to the same conclusion as the asymptotic test result, 
but the exact p value is very different from 0.05, while the asymptotic p value is only 
marginally larger than 0.05. In this example, all 4 margins of the 2 x 2 table were fixed 
by design. For the example, in “Pearson Chi-Square Test for a 3 x 4 Table” on p. 15, the 
margins were not fixed. Nevertheless, for both examples, the reference set was 
constructed from fixed row and column margins. Whether or not the margins of the 
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observed contingency table are naturally fixed is irrelevant to the method used to 
compute the exact test. In either case, you compute an exact p value by examining the 
observed table in relation to all other tables in a reference set of contingency tables 
whose margins are the same as those of the actually observed table. You will see that the 
idea behind this relatively simple example generalizes to include all of the 
nonparametric and categorical data settings covered by Exact Tests. 


Choosing between Exact, Monte Carlo, and Asymptotic P Values 


The above examples illustrate that in order to compute an exact p value, you must 
enumerate all of the outcomes that could occur in some reference set besides the 
outcome that was actually observed. Then you order these outcomes by some measure 
of discrepancy that reflects deviation from the null hypothesis. The exact p value is the 
sum of exact probabilities of those outcomes in the reference set that are at least as 
extreme as the one actually observed. 

Enumeration of all of the tables in a reference set can be computationally intensive. 
For example, the reference set of all 5 x 6 tables of the form 
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4 5 6 5 7 7 34 


contains 1.6 billion tables, which presents a challenging computational problem. Fortu- 
nately, two developments have made exact p value computations practically feasible. 
First, the computer revolution has dramatically redefined what is computationally do- 
able and affordable. Second, many new fast and efficient computational algorithms have 
been published over the last decade. Thus, problems that would have taken several hours 
or even days to solve now take only a few minutes. 

It is useful to have some idea about how the algorithms in Exact Tests work. There are 
two basic types of algorithms: complete enumeration and Monte Carlo enumeration. The 
complete enumeration algorithms enumerate every single outcome in the reference set. 
Thus they always produce the exact p value. Their result is essentially 100% accurate. 
They are not, however, guaranteed to solve every problem. Some data sets might be too 
large for complete enumeration of the reference set within given time and machine limits. 
For this reason, Monte Carlo enumeration algorithms are also provided. These algorithms 
enumerate a random subset of all the possible outcomes in the reference set. The Monte 
Carlo algorithms provide an estimate of the exact p value, called the Monte Carlo p value, 
which can be made as accurate as necessary for the problem at hand. Typically, their re- 
sult is 99% accurate, but you are free to increase the level of accuracy to any arbitrary 
degree simply by sampling more outcomes from the reference set. Also, they are guaran- 
teed to solve any problem, no matter how large the data set. Thus, they provide a robust, 
reliable back-up for the situations in which the complete enumeration algorithms fail. Fi- 
nally, the asymptotic p value is always available by default. 

General guidelines for when to use the exact, Monte Carlo, or asymptotic p values 
include the following: 


e It is wise to never report an asymptotic p value without first checking its accuracy 
against the corresponding exact or Monte Carlo p value. You cannot easily predict a 
priori when the asymptotic p value will be sufficiently accurate. 


e The choice of exact versus Monte Carlo is largely one of convenience. The time 
required for the exact computations is less predictable than for the Monte Carlo 
computations. Usually, the exact computations either produce a quick answer, or 
else they quickly terminate with the message that the problem is too hard for the 
exact algorithms. Sometimes, however, the exact computations can take several 
hours, in which case it is better to interrupt them by selecting Stop Processor from 
the File menu and repeating the analysis with the Monte Carlo option. The Monte 
Carlo p values are for most practical purposes just as good as the exact p values. 
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The method has the additional advantage that it takes a predictable amount of time, 
and an answer is available at any desired level of accuracy. 


e Exact Tests makes it very easy to move back and forth between the exact and Monte 
Carlo options. So feel free to experiment. 


The following sections discuss the exact, Monte Carlo, and asymptotic p values in 
greater detail. 


When to Use Exact P Values 


Ideally you would use exact p values all of the time. They are, after all, the gold stan- 
dard. Only by deciding to accept or reject the null hypothesis on the basis of an exact p 
value are you guaranteed to be protected from type | errors at the desired significance 
level. In practice, however, it is not possible to use exact p values all of the time. The 
algorithms in Exact Tests might break down as the size of the data set increases. It is dif- 
ficult to quantify just how large a data set can be solved by the exact algorithms, because 
that depends on so many factors other than just the sample size. You can sometimes 
compute an exact p value for a data set whose sample size is over 20,000, and at other 
times fail to compute an exact p value for a data set whose sample size is less than 30. 
The type of exact test desired, the degree of imbalance in the allocation of subjects to 
treatments, the number of rows and columns in a crosstabulation, the number of ties in 
the data, and a variety of other factors interact in complicated ways to determine if a par- 
ticular data set is amenable to exact inference. It is thus a very difficult task to specify 
the precise upper limits of computational feasibility for the exact algorithms. It is more 
useful to specify sample size and table dimension ranges within which the exact algo- 
rithms will produce quick answers—that is, within a few seconds. Table 1.1 and Table 
1.2 describe the conditions under which exact tests can be computed quickly. In general, 
almost every exact test in Exact Tests can be executed in just a few seconds, provided 
the sample size does not exceed 30. The Kruskal-Wallis test, the runs tests, and tests on 
the Pearson and Spearman correlation coefficients are exceptions to this general rule. 
They require a smaller sample size to produce quick answers. 


When to Use Monte Carlo P Values 


Many data sets are too large for the exact p value computations, yet too sparse or 
unbalanced for the asymptotic results to be reliable. Figure 2.10 is an example of such 
a data set, taken from Senchaudhuri, Mehta, and Patel (1995). This data set reports the 
thickness of the left ventricular wall, measured by echocardiography, in 947 athletes 
participating in 25 different sports in Italy. There were 16 athletes with a wall 
thickness of = 13mm, which is indicative of hypertrophic cardiomyopathy. The 
objective is to determine whether there is any correlation between presence of this 
condition and the type of sports activity. 
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Figure 2.10 Left ventricular wall thickness versus sports activity 


Count 
Left Ventricular Wall 
Thickness 
>= 13 
mm <13mm Total 
SPORT Weightlifting 1 6 7 
Field wt. events 9 9 
Wrestling/Judo 16 16 
Tae kwon do 1 16 17 
Roller Hockey 1 22 23 
Team Handball 1 25 26 
a coun 1 30 31 
Alpine Skiing 32 32 
Pentathlon 50 50 
Roller Skating 58 58 
Equestrianism 28 28 
Bobsledding 1 15 16 
Volleyball 51 51 
Diving 1 10 11 
Boxing 14 14 
Cycling 1 63 64 
Water Polo 21 21 
Yatching 24 24 
Canoeing 3 57 60 
Fencing 1 41 42 
Tennis 47 47 
Rowing 4 91 95 
Swimming 54 54 
Soccer 62 62 
Track 89 89 
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You can obtain the results of the likelihood-ratio statistic for this 25 x 2 contingency ta- 
ble with the Crosstabs procedure. The results are shown in Figure 2.11. 


Figure 2.11 Likelihood ratio for left ventricular wall thickness versus sports activity data 


Chi-Square Tests 


Asymp. 
Sig. 
Value df (2-tailed) 
Likelihood Ratio 32.495 24 115 


The value of this statistic is 32.495. The asymptotic p value, based on the likelihood- 
ratio test, is therefore the tail area to the right of 32.495 from a chi-square distribution 
with 24 degrees of freedom. The reported p value is 0.115. But notice how sparse and 
unbalanced this table is. This suggests that you ought not to rely on the asymptotic p 
value. Ideally, you would like to enumerate every single 25 x 2 contingency table with 
the same row and column margins as those in Figure 2.10, identify tables that are more 
extreme than the observed table under the null hypothesis, and thereby obtain the exact 
p value. This is a job for Exact Tests. However, when you try to obtain the exact 
likelihood-ratio p value in this manner, Exact Tests gives the message that the problem 
is too large for the exact option. Therefore, the next step is to use the Monte Carlo 
option. The Monte Carlo option can generate an extremely accurate estimate of the exact 
p value by sampling 25 x 2 tables from the reference set of all tables with the observed 
margins a large number of times. The default is 10,000 times, but this can easily be 
changed in the dialog box. Provided each table is sampled in proportion to its 
hypergeometric probability (see Equation 2.4), the fraction of sampled tables that are at 
least as extreme as the observed table gives an unbiased estimate of the exact p value. 
That is, if M tables are sampled from the reference set, and O of them are at least as 
extreme as the observed table (in the sense of having a likelihood-ratio statistic greater 
than or equal to 32.495), the Monte Carlo estimate of the exact p value is 


Q 


Dp —— Equation 2.7 
M 


The variance of this estimate is obtained by straightforward binomial theory to be: 


var(p) = a Equation 2.8 
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Figure 2.12 Monte Carlo results for left ventricular wall thickness versus sports activity data 


Chi-Square Tests 


Monte Carlo Significance 
(2-tailed) 
99% Confidence 
Asymp. Interval 
Sig. Lower Upper 
Value df (2-tailed) Sig. Bound Bound 
Likelihood Ratio 32.495 24 115 0442 .039 .050 
2. Based on 10000 and seed 2000000 ... 
Thus,a 100x(1-—Yy) % confidence interval for p is 
CI = p£Zy/2 pd =p) Equation 2.9 


M 


where z, is the mth percentile of the standard normal distribution. For example, if you 
wanted a 99% confidence interval for p, you would use Zp 995 = —2.576. This is the de- 
fault in Exact Tests, but it can be changed in the dialog box. The Monte Carlo results for 
these data are shown in Figure 2.12. 


The Monte Carlo estimate of 0.044 for the exact p value is based on 10,000 random 
samples from the reference set, using a starting seed of 2000000. Exact Tests also 
computes a 99% confidence interval for the exact p value. This confidence interval is 
(0.039, 0.050). You can be 99% sure that the true p value is within this interval. The 
width can be narrowed even further by sampling more tables from the reference set. That 
will reduce the variance (see Equation 2.8) and hence reduce the width of the confidence 
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interval (see Equation 2.9). It is a simple matter to sample 50,000 times from the 
reference set instead of only 10,000 times. These results are shown in Figure 2.13. 


Figure 2.13 Monte Carlo results with sample size of 50,000 


Chi-Square Tests 


Monte Carlo Significance 


(2-tailed) 
99% Confidence 
Asymp Interval 
Sig. Lower Upper 
Value df (2-tailed) Sig. Bound Bound 


Likelihood Ratio .047 


2. Based on 50000 and seed 2000000 ... 


With a sample of size 50,000 and the same starting seed, 2000000, you obtain 0.045 as 
the Monte Carlo estimate of p. Now the 99% confidence interval for p is (0.043, 0.047). 
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How good are the Monte Carlo estimates? Why would you use them rather than the 


asymptotic p value of 0.115? There are several major advantages to using the Monte 
Carlo method as opposed to using the asymptotic p value for inference. 


1, 
2s 


The Monte Carlo estimate is unbiased. That is, E(p) = p. 


The Monte Carlo estimate is accompanied by a confidence interval within which the 
exact p value is guaranteed to lie at the specified confidence level. The asymptotic p 
value is not accompanied by any such probabilistic guarantee. 


. The width of the confidence interval can be made arbitrarily small, by sampling more 


tables from the reference set. 


. In principle, you could narrow the width of the confidence interval to such an extent 


that the Monte Carlo p value becomes indistinguishable from the exact p value up to 
say the first three decimal places. For all practical purposes, you could then claim to 
have the exact p value. Of course, this might take a few hours to accomplish. 


. In practice, you don’t need to go quite so far. Simply knowing that the upper bound 


of the confidence interval is below 0.05, or that the lower bound of the confidence 
interval is above 0.05 is satisfying. Facts like these can usually be quickly established 
by sampling about 10,000 tables, and this takes only a few seconds. 


. The asymptotic p value carries no probabilistic guarantee whatsoever as to its accu- 


racy. In the present example, the asymptotic p value is 0.115, implying, incorrectly, 
that there is no interaction between the ventricular wall thickness and the sports ac- 
tivity. The Monte Carlo estimate on the other hand does indeed establish this rela- 
tionship at the 5% significance level. 


To summarize: 


The Monte Carlo option with a sample of size 10,000 and a confidence level of 99% 
is the default in Exact Tests. At these default values, the Monte Carlo option provides 
very accurate estimates of exact p values in a just few seconds. These defaults can be 
easily changed in the Monte Carlo dialog box. 


Users will find that even when the width of the Monte Carlo confidence interval is 
wider than they’d like, the point estimate itself is very close to the exact p value. 
For the fire fighters data discussed in “Pearson Chi-Square Test for a 3 x 4 Table” 
on p. 15, the Monte Carlo estimate of the exact p value for the Pearson chi-square 
test is shown in Figure 2.14. 
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Figure 2.14 Monte Carlo results of Pearson chi-square test for fire fighter data 


Chi-Square Tests 


Monte Carlo Significance (2-tailed) 


99% Confidence Interval 


Asymp. 
Sig. 
(2-tailed) 
Pearson 
Chi-Square wee 


1. 12 cells (100.0%) have expected count less than 5. The minimum expected count is .50. 
2. Based on 10000 and seed 2000000 ... 


The result, based on 10,000 observations and a starting seed of 2000000, is 0.041. This is 
much closer to the exact p value for the Pearson test, 0.040, than the asymptotic p value, 
0.073. As an exercise, run the Monte Carlo version of the Pearson test on this data set a few 
times with different starting seeds. You will observe that the Monte Carlo estimate changes 
slightly from run to run, because you are using a different starting seed each time. However, 
you will also observe that each Monte Carlo estimate is very close to the exact p value. 
Thus, even if you ignored the information in the confidence interval, the Monte Carlo point 
estimate itself is often good enough for routine use. For a more refined analysis, you may 
prefer to report both the point estimate and the confidence interval. 


e If you want to replicate someone else’s Monte Carlo results, you need to know the 
starting seed used previously. Exact Tests reports the starting seed each time you run 
a test. If you don’t specify your own starting seed, Exact Tests provides one. See 
“How to Set the Random Number Seed” on p. 9 in Chapter 1 for information on set- 
ting the random number seed. 


When to Use Asymptotic P Values 


Although the exact p value can be shown to converge mathematically to the 
corresponding asymptotic p value as the sample size becomes infinitely large, this 
property is not of much practical value in guaranteeing the accuracy of the asymptotic p 
value for any specific data set. There are many different data configurations where the 
asymptotic methods perform poorly. These include small data sets, data sets containing 
ties, large but unbalanced data sets, and sparse data sets. A numerical example follows 
for each of these situations. 
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Small Data Sets. The data set shown in Figure 2.15 consists of the first 7 pairs of obser- 
vations of the authoritarianism versus social status striving data discussed in Siegel and 
Castellan (1988). 


Figure 2.15 Subset of authoritarianism versus social status striving data 


subject author social 
1 82 42 
2 98 46 
3 87 39 
4 40 3? 
at 116 65 
6 113 88 
7 111 86 


Pearson’s product-moment correlation coefficient computed from this sample is 0.7388. 
This result is shown in Figure 2.16. 


Figure 2.16 Pearson’s product-moment correlation coefficient for social status striving data 


Symmetric Measures 


Asymp. Approx. Exact 
Value Std. Error | Approx. T Sig. Significance 
' 1 
Interval by Interval ee 739 054 2.452 058 037 


1. Based on normal approximation 


Suppose that you wanted to test the null hypothesis that these data arose from a population 
in which the underlying Pearson’s product-moment correlation coefficient is 0, against the 
one-sided alternative that authoritarianism and social status striving are positively corre- 
lated. Using the techniques described in Chapter 1, you see that the asymptotic two-sided 
p value is 0.058. In contrast, the exact one-sided p value is 0.037. You can conclude that 
the asymptotic method does not perform well in this small data set. 
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Data With Ties. The diastolic blood pressure (mm Hg) was measured on 6 subjects in a 
treatment group and 7 subjects in a control group. The data are shown in Figure 2.17. 


Figure 2.17 Diastolic blood pressure of treated and control groups 


pressure group 
94 Treated 

108 Treated 

110 Treated 

90 Treated 

108 Treated 


105 Treated 


80 Control 
94 Control 
94 Control 
90 Control 
90 Control 
94 Control 


94 Control 
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The results of the two-sample Kolmogorov-Smirnov test for these data are shown in 
Figure 2.18. 


Figure 2.18 Two-sample Kolmogorov-Smirnov test results for diastolic blood pressure data 


Frequencies 


Diastolic Treated 
Blood Control 
Pressure 
Total 
Test Statistics! 
Diastolic 
Blood 
Pressure 
Most Extreme Differences Absolute .667 
Positive .667 
Negative .000 
Kolmogorov-Smirnov Z 1.198 
Asymp. Sig. (2-tailed) 113 
Exact Significance (2-tailed) 042 
Point Probability 


1. Grouping Variable: GROUP 
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The asymptotic two-sided p value is 0.113. In contrast, the exact two-sided p value is 
0.042, less than half the asymptotic result. The poor performance of the asymptotic test 
is attributable to the large number of tied observations in this data set. Suppose, for ex- 
ample, that the data were free of any ties, as shown in Figure 2.19. 


Figure 2.19 Diastolic blood pressure of treated and control groups, without ties 


reer [run 
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The two-sample Kolmogoroy-Smirmov results for these data, without ties, are shown in 
Figure 2.20. 


Figure 2.20 Two-sample Kolmogorov-Smirnov test results for diastolic blood pressure data, 
without ties 


Frequencies 


N 
Diastolic | GROUP | Treated 
Blood Control 
Pressure 
Total 13 
Test Statistics! 
Diastolic 
Blood 
Pressure 
Most Extreme Differences Absolute .667 
Positive .667 
Negative .000 
Kolmogorov-Smirnov Z 1.198 
Asymp. Sig. (2-tailed) 113 
Exact Significance (2-tailed) 042 
Point Probability .042 


1. Grouping Variable: GROUP 


The asymptotic Kolmogorov-Smimov two-sided p value remains unchanged at 0.113. 
This time, however, it is much closer to the exact two-sided p value, which is 0.091. 
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Large but Unbalanced Data Sets 


Data from a prospective study of maternal drinking and congenital sex organ malforma- 
tions (Graubard and Korn, 1987) are shown in Figure 2.21 in the form of a 2 x 5 con- 
tingency table. 


Figure 2.21 Alcohol during pregnancy and birth defects 


Malformation * Maternal Alcohol Consumption (drinks/day) Crosstabulation 


Count 
Maternal Alcohol Consumption (drinks/day) 
0 <1 1-2 3-5 >=6 
Malformation | Absent 17066 14464 788 126 37 
Present 48 38 5 1 1 


The linear-by-linear association test may be used to determine if there is a dose-response re- 
lationship between the average number of drinks consumed each day during pregnancy, and 
the presence of a congenital sex organ malformation. The results are shown in Figure 2.22. 


Figure 2.22 Results of linear-by-linear association test for maternal drinking data 


Chi-Square Tests 
Asymp. 


Sig. Exact Sig. |Exact Sig. Point 
(2-tailed) | (2-tailed) | (1-tailed) | Probability 


Linear-by-Linear 


st 1.828" 1 .176 .179 105 .028 
Association 


2. Standardized stat. is 1.352 ... 


The asymptotic two-sided p value is 0.176. In contrast, the two-sided exact p value is 
0.179. 


Sparse Data Sets 


Data were gathered from 250 college and university administrators on various indicators 
of performance like the number of applications for admittance, student/faculty ratio, 
faculty salaries, average SAT scores, funding available for inter-collegiate sports, and so 
forth. Figure 2.23 shows a crosstabulation of competitiveness against the student/faculty 
ratio for a subset consisting of the 65 state universities that participated in the survey. 


Figure 2.23 Student/faculty ratio versus competitiveness of state universities 


Student/Faculty Ratio * Competitiveness of Institution Crosstabulation 


Count 
Competitiveness of Institution 
Average Very Highly Total 
Student/Faculty 1 1 
Ratio 2 
2 
1 
3 
3 5 
2 3 
3 5 
3 7 
5 8 
5 6 
2 6 
2 7 
2 2 
2 
1 
1 
2 
1 
65 
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Figure 2.24 shows the asymptotic results of the Pearson chi-square test for these data. 


Figure 2.24 Monte Carlo results for student/faculty ratio vs. competitiveness data 


Chi-Square Tests 


Monte Carlo Significance 
(2-tailed) 
99% Confidence 
Asymp. Interval 
Sig. Lower Upper 
Value df (2-tailed) Sig. Bound Bound 
1 2 
Realsel 94.424 72 039 114 106 122 
Chi-Square 


1. 95 cells (100.0%) have expected count less than 5. The minimum expected count is .02. 
2. Based on 10000 and seed 2000000 ... 


The asymptotic p value based on the Pearson chi-square test is 0.039, suggesting that 
there is an interaction between competitiveness and the student/faculty ratio. Notice, 
however, that the table, though large, is very sparse. Because this data set is so large, the 
Monte Carlo result, rather than the exact result, is shown. The Monte Carlo estimate of 
the exact p value is 0.114. This is a three-fold increase in the p value, which suggests 
that there is, after all, no interaction between competitiveness and the student/faculty 
ratio at state universities. 

It should be clear from the above examples that it is very difficult to predict a priori if 
a given data set is large enough to rely on an asymptotic approximation to the p value. The 
notion of what constitutes a large sample depends on the structure of the data and the test 
being used. It cannot be characterized by any single measure. A crosstabulation created 
from several thousand observations might nevertheless produce inaccurate asymptotic p 
values if it possesses many cells with small counts. On the other hand, a rank test like the 
Wilcoxon, performed on continuous, well-balanced data, with no ties, could produce an 
accurate asymptotic p value with a sample size as low as 20. Ultimately, the best 
definition of a large data set is an operational one—if a data set produces an accurate 
asymptotic p value, it is large; otherwise, it is small. In the past, such a definition would 
have been meaningless, since there was no gold standard by which to gauge the accuracy 
of the asymptotic p value. In Exact Tests, however, either the exact p value or its Monte 
Carlo estimate is readily available to make the comparison and may be used routinely 
along with the asymptotic p value. 
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One-Sample Goodness-of-Fit 
Inference 


This chapter discusses tests used to determine how well a data set is fitted by a specified 
distribution. Such tests are known as goodness-of-fit tests. Exact Tests computes exact 
and asymptotic p values for the chi-square and Kolmogorov-Smirnov tests. 


Available Tests 


Table 3.1 shows the goodness-of-fit tests available in Exact Tests, the procedure from 
which each can be obtained, and a bibliographical reference for each. 


Table 3.1 Available tests 


Test Procedure References 

Chi-square Nonparametric Tests: Chi-square Siegel and Castellan (1988) 
Kolmogorov- Nonparametric Tests: 1 Sample K-S Conover (1980) 

Smirnov 


Chi-Square Goodness-of-Fit Test 


The chi-square goodness-of-fit test is applicable either to categorical data or to 
continuous data that have been pre-grouped into a discrete number of categories. In 
tabular form, the data are organized as a 1 x c contingency table, where c is the number 
of categories. Cell i of this 1 x c table contains a frequency count, O; , of the number 
of observations falling into category i. Along the bottom of the table is a (1 x c) vector 
of cell probabilities 


TM = (My, M, -.-T,) Equation 3.1 


such that 1; is associated with column 7. This representation is shown in Table 3.2 
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Table 3.2 Frequency counts for chi-square goodness-of-fit test 


Multinomial Categories Row 


1 2 Total 
Cell Counts O, O,. .. O. N 
Cell Probabilities ™, 1) .. , 1 


The chi-square goodness-of-fit test is used to determine with judging if the data arose 
by taking N independent samples from a multinomial distribution consisting of c 
categories with cell probabilities given by 1. The null hypothesis 


H):(O,, O>, ...O..) ~ Multinomial(z, NV) Equation 3.2 


can be tested versus the general alternative that H is not true. The test statistic for the 
test is 


C 
Xx = > (0;- Ey /E; Equation 3.3 
i=l 


where FE, = Nr, is the expected count in cell 7. High values of X° indicate lack of fit 
and lead to rejection of H,. If Hy is true, asymptotically, as No, the random 
variable X° converges in distribution to a chi-square distribution with (c—1) degrees 
of freedom. The asymptotic p value is, therefore, given by the right tail of this 
distribution. Thus, if x’ is the observed value of the test statistic X” , the asymptotic 
two-sided p value is given by 


ws 2 
Py = Preys 2x) Equation 3.4 


The asymptotic approximation may not be reliable when the £,’s are small. For exam- 
ple, Siegel and Castellan (1988) suggest that one can safely use the approximation only 
if at least 20% of the E;’s equal or exceed 5 and none of the £,’s are less than 1. In cases 
where the asymptotic approximation is suspect, the usual procedure has been to collapse 
categories to meet criteria such as those suggested by Siegel and Castellan. However, 
this introduces subjectivity into the analysis, since differing p values can be obtained by 
using different collapsing schemes. Exact Tests gives the exact p values without making 
any assumptions about the 7,’s or N. 
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The exact p value is computed in Exact Tests by generating the true distribution of 

under H,. Since there is no approximation, there is no need to collapse categories, 
and the natural categories for the data can be maintained. Thus, the exact two-sided p 
value is given by 


Pr = Pr(x2 > x7) Equation 3.5 


Sometimes a data set is too large for the exact p value to be computed, yet there might 
be reasons why the asymptotic p value is not sufficiently accurate. For these situations, 
Exact Tests provides a Monte Carlo estimate of the exact p value. This estimate is ob- 
tained by generating M multinomial vectors from the null distribution and counting how 
many of them result in a test statistic whose value equals or exceeds x , the test statistic 
actually observed. Suppose that this number is m. If so, a Monte Carlo estimate of p, is 


P, = m/M Equation 3.6 


A 99% confidence interval for p, is then obtained by standard binomial theory as 


Cl= Ds + 2.576 }(P,)C1 —P)/M Equation 3.7 


A technical difficulty arises when either p,= 0 or Po = 1. Now the sample standard 
deviation is 0, but the data do not support a confidence interval of zero width. An 
alternative way to compute a confidence interval that does not depend on o is based on 
inverting an exact binomial hypothesis test when an extreme outcome is encountered. If 
P>= 0,an &% confidence interval for the exact p value is 

CI = [0,1-(1-«/100)!™ Equation 3.8 


Similarly, when D> = 1,an &% confidence interval for the exact p value is 
CI = [1 -0/100) 11] Equation 3.9 
Exact Tests uses default values of M = 10000 and a = 99%. While these defaults can 


be easily changed, they provide quick and accurate estimates of exact p values for a wide 
range of data sets. 
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Example: A Small Data Set 


Table 3.3 shows the observed counts and the multinomial probabilities under the null 
hypothesis for a multinomial distribution with four categories. 


Table 3.3. Frequency counts from a multinomial distribution with four categories 


Multinomial Row 
Categories Total 
1 2 3 4 
Cell Counts 7 1 1 1 10 


Cell Probabilities 0.3 0.3 0.3 .0.1 1 


The results of the exact chi-square goodness-of-fit test are shown in Figure 3.1 


Figure 3.1 Chi-square goodness-of-fit results 
CATEGORY 


Observed N | Expected N | Residual 
7 3.0 4.0 


1 3.0 -2.0 


1 3.0 -2.0 
1 1.0 0 


Test Statistics 


Point 
Chi- Square’ as ee mp. =. —_y si. co 


| CATEGORY | | _ 8.000 | 000 


eee 4 cells mL 2 have ots | __*6 less than 5. The minimum 
expected cell frequency is 1.0. 


The value of the chi-square goodness-of-fit statistic is 8.0. Referring this value to a chi- 
square distribution with 3 degrees of freedom yields an asymptotic p value 


Dy = (Pry, 28.0) = 0.046 


However, there are many cells with small counts in the observed 1 x4 contingency 
table. Thus, the asymptotic approximation is not reliable. In fact, the exact p value is 


py = Pr(% 28.0) = 0.0523 
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Exact Tests also provides the point probability that the test statistic equals 8.0. This 
probability, 0.0203, is a measure of the discreteness of the exact distribution of x. 
Some statisticians advocate subtracting half of this point probability from the exact p 
value, and call the result the mid-p value. 

Because of its small size, this data set does not require a Monte Carlo analysis. 
However, results obtained from a Monte Carlo analysis are more accurate than results 
produced by an asymptotic analysis. Figure 3.2 shows the Monte Carlo estimate of the 
exact p value based on a Monte Carlo sample of 10,000. 


Figure 3.2 Monte Carlo results for chi-square test 


CATEGORY 
Residual 
4.0 
-2.0 
-2.0 
0 


Total 10 


Test Statistics 


Monte Carlo Sig. 
99% Confidence Interval 


Chi-Square' Asymp. Sig. ig. _| Lower Bound | Upper Bound 
CATEGORY 8.000 .046 044 055 


1. 4 cells (100.0%) have expected frequencies less than 5. The minimum expected cell 
frequency is 1.0. 


2. Based on 10000 sampled tables with starting seed 2000000. 


The Monte Carlo estimate of the exact p value is 0.0493, which is much closer to the exact 
p value of 0.0523 than the asymptotic result. But a more important benefit of the Monte 
Carlo analysis is that we also obtain a 99% confidence interval. In this example, with a 
Monte Carlo sample of 10,000, the interval is (0.0437, 0.0549). This interval could be 
narrowed by sampling more multinomial vectors from the null distribution. To obtain 
more conclusive evidence that the exact p value exceeds 0.05 and thus is not statistically 
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significant at the 5% level, 100,000 multinomial vectors can be sampled from the null 
distribution. The results are shown in Figure 3.3. 


Figure 3.3. Monte Carlo results for chi-square test with 100,000 samples 


Test Statistics 


Monte Carlo Sig. 


99% Confidence Interval 


Lower Bound 
.049 


Chi-Square’ 
8.000 


Asymp. Sig. | Sig. 
046 | .0512 


Upper Bound 
.053 


CATEGORY 


1. 4 cells (100.0%) have expected frequencies less than 5. The minimum expected cell 
frequency is 1.0. 


2. Based on 100000 sampled tables with starting seed 2000000. 


This time, the Monte Carlo estimate is 0.0508, almost indistinguishable from the exact 
result. Moreover, the exact p value is guaranteed, with 99% confidence, to lie within the 
interval (0.0490, 0.0525). We are now 99% certain that the exact p value exceeds 0.05. 


Example: A Medium-Sized Data Set 


This example shows that the chi-square approximation may not be reliable even when 
the sample size is as large as 50, has only three categories, and has cell counts that satisfy 
the Siegel and Castellan criteria discussed on p. 42. Table 3.4 displays data from Radlow 
and Alt (1975) showing observed counts and multinomial probabilities under the null 
hypothesis for a multinomial distribution with three categories. 


Table 3.4 Frequency counts from a multinomial distribution with three categories 


Multinomial Row 
Categories Total 
1 2 3 
Cell counts 12 7 31 50 


Cell Probabilities 0.2 03 0.5 1 


Figure 3.4 shows the results of the chi-square goodness-of-fit test on these data. 
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Figure 3.4 Chi-square goodness-of-fit results for medium-sized data set 


Multinomial Categories 


| | Observed N_| Expected N 
1 : 


12 

7 
31 
50 


Test Statistics 
Multinomial 
Categories 
Chi-Square’ 6.107 
df 2 
Asymp. Sig. .047 
Exact Sig. .051 
Point Probability .002 


1. 0 cells (.0%) have expected frequencies less than 
5. The minimum expected cell frequency is 10.0. 


Notice that the asymptotic approximation gives a p value of 0.0472, while the exact p 
value is 0.0507. Thus, at the 5% significance level, the asymptotic value erroneously 
leads to rejection of the null hypothesis, despite the reasonably large sample size, the 
small number of categories, and the fact that E; 210 fori = 1,2,3. 


One-Sample Kolmogorov Goodness-of-Fit Test 


The one-sample Kolmogorov test is used to determine if it is reasonable to model a data 
set consisting of independent identically distributed (i.1.d.) observations from a 
completely specified distribution. Exact tests offers this test for the normal, uniform, and 
Poisson distributions. 
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The data consist of N i.i.d. observations, (u;,W>, ...uUy), from an unknown distribution 
G(u);ie. Gu) = Pr(Usu).Let F(u) beacompletely specified distribution. The 
Kolmogorov test is used to test the null hypothesis 


Hy:G(u) = F(u)for all wu Equation 3.10 


Hy can be tested against either a two-sided alternative or a one-sided alternative. The 
two-sided alternative is 


H,:G(u) # F(u)for at least one value of u Equation 3.11 


Two one-sided alternative hypotheses can be specified. One states that F is stochastically 
greater than G. That is, 


H, ,:G(u) < F(u)for at least one value of u Equation 3.12 


The other one-sided alternative states the complement, that G is stochastically greater 
than F. That is, 


H,,:F(u) < G(u)for at least one value of u Equation 3.13 


The test statistics for testing Hy against either H,, H,,, or H,, are all functions of the 
specified distribution, F(w), and the empirical cumulative density function (c.d.f.), 
S(u), is derived from the observed values, (u;,u>, ...u,). The test statistic for testing 
Hy against H, is 


T= sup {| F(u) — S(u}} Equation 3.14 
The test statistic for testing Hj against H,, is 

—_ sup{F(u) — S(u)} Equation 3.15 
The test statistic for testing Hj against H,, is 

Tes sup{S(u) — F(u)t Equation 3.16 


Kolmogorov derived asymptotic distributions as N > ©, for 7, T and 7”. For small 
N, the exact p values provided by Exact Tests are appropriate. If F(w) is a discrete dis- 
tribution, the exact p values can be computed using the method described by Conover 
(1980). If F(w) is a continuous distribution, the exact p value can be computed using 
the results given by Durbin (1973). 
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Example: Testing for a Uniform Distribution 
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This example is taken from Conover (1980). A random sample size of 10 is drawn from 
a continuous distribution. The sample can be tested to determine if it came from a uniform 
continuous distribution with limits of 0 and 1. Figure 3.5 shows the data displayed n the 


Data Editor. 


Figure 3.5 Data to test for a uniform distribution 


|_obsev | value 


OC] WO] CO) “| OD) OT] &} Go) RO) Sm 


= 


We can run the Kolmogorov-Smirnov test to determine if the sample was generated by 
a uniform distribution. The results are displayed in Figure 3.6. 


Figure 3.6 Kolmogorov-Smirnov results 


One-Sample Kolmogorov-Smirnov Test 


Uniform Parameters ‘| Most Extreme Differences 


Maximum _| Absolute 


0 .289 .289 


Kolmogorov- 
Negative Smirnov Z 


-.229 914 


Minimum 
VALUE 10 0 


Exact 
Significance 
2-tailed 

311 


.000 


1. Test distribution is Uniform. 
2. User-Specified 


The exact exact two-sided p value is 0.311. The asymptotic two-sided p value is 0.3738. 


One-Sample Inference for 
Binary Data 


This chapter discusses two statistical procedures for analyzing binary data in Exact 
Tests. First, it describes exact hypothesis testing and exact confidence interval 
estimation for a binomial probability. Next, it describes the runs test (also known as 
the Wald-Wolfowitz one-sample runs test) for determining if a sequence of binary 
observations is random. You will see that although the theory underlying the runs test 
is based on a binary sequence, the test itself is applied more generally to non-binary 
observations. For this reason, the data are transformed automatically in Exact Tests 
from a non-binary to a binary sequence prior to executing the test. 


Available Tests 


Table 4.1 shows the tests for binary data available in Exact Tests, the procedure from 
which each can be obtained, and a bibliographical reference for each. 


Table 4.1 Available tests 


Test Procedure Reference 
Binomial test Nonparametric Tests: Binomial Test Conover (1971) 
Runs test Nonparametric Tests: Runs Test Lehmann (1975 


Binomial Test and Confidence Interval 


The data consist of ¢ successes and N—¢ failures in N independent Bernoulli trials. 
Let m be the true underlying success rate. Then the outcome 7 = t¢ has the binomial 
probability 


Pr(T = ¢\n) = ("\a'a See Equation 4.1 
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Exact Tests computes the observed proportion 7, which is also the maximum-likelihood 


estimate of 7, as 


Tw = t/N 
To test the null hypothesis 


Ao:0 = 1, 


Exact Tests computes the following one- and two-sided p values: 


P,= min {Pr(7T'S t\n,), Pr(T 2 t|n,,)} 


and 


Pa 2° Pt 


Example: Pilot Study for a New Drug 


Equation 4.2 


Equation 4.3 


Equation 4.4 


Twenty patients were treated in a pilot study of a new drug. There were four responders 
(successes) and 16 non-responsive (failures). The binomial test can be run to test the null 


hypothesis that = 0.05. 


These data can be entered into the Data Editor using a response variable with 20 cases. 
If successes are coded as 1’s, and failures are coded as 0’s, response contains sixteen cases 


with a value of 0, and four cases with a value of 1. 


The binomial test performed on these data produces the results displayed in Figure 4.1. 


Figure 4.1 Binomial test results for drug study 


Exact Sig. Point 
Category (1-tailed) | Probability 
Response Success .016 013 


to Drug 


Failure 


The exact one-sided p value is 0.0159, so the null hypothesis that 1 = 0.05 is rejected at 
the 5% significance level. 
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Runs Test 


Consider a sequence of N binary outcomes, (7/5, ...¥y), where each y, is either a 0 or 
a 1. Arun is defined as a succession of identical numbers that are followed and preceded 
by a different number, or no number at all. For example, the sequence 


(1, 1, 0, 0, 0, 1,0, 0, 0, 0, 1, 1, 0, 1) 


begins with a run of two 1’s. A run of three 0’s follows, and next a run of one 1. Then 
comes a run of four 0’s, followed by a run of two 1’s which in turn is followed by a run 
of one 0. Finally, there is a run of one 1. In all, there are seven runs in the above 
sequence. Let the random variable R denote the number of runs in a binary sequence 
consisting of m 1’s and n 0’s, where m+n = N. The Wald-Wolfowitz runs test is used 
to test the null hypothesis 


HH: The sequence of m 1’s and 0’s, (m+n) = N,, was generated by N independent 
Bernoulli trials, each with a probability = of generating a | and a probability 
(1-1) of generating a 0. 


Very large or very small values of R are evidence against H, . In order to determine what 
constitutes a very large or a very small run, the distribution of R is needed. Although 
unconditionally the distribution of R depends on 7, this nuisance parameter can be 
eliminated by working with the conditional distribution of R, given that there are m 1’s 
and n 0’s in the sequence. This conditional distribution can be shown to be 


(alee ae 


Pr(R = 2k) = ——————. Equation 4.5 
and 


Cet eC Ge, 


PrRS ope) ee Equation 4.6 
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Suppose that r is the observed value of the random variable R. The two-sided exact p 
value is defined as 


Pee (Pr|R - E(R)| 2 |r - E(R)) Equation 4.7 


where E(R) is the expected value of R. 


If a data set is too large for the computation shown in Equation 4.7 to be feasible, these 
p Values can be estimated very accurately using Monte Carlo sampling. 

For large data sets, asymptotic normality can be invoked. Let 7 denote the observed 
value of the random variable R, h = 0.5 if r<(2mn/N)+1, and h = -0.5 if 
r>(2mn/N) + 1. Then the statistic 


r+h-(2mn/N)-1 Equation 4.8 


Onnme —N)|/[N°(n -1)] 
is normally distributed with a mean of 0 and a variance of 1. 


The above exact, Monte Carlo, and asymptotic results apply only to binary data. How- 
ever, you might want to test for the randomness of any general data series x), xy, ...Xy, 
where the x, ’s are not binary. In that case, the approach suggested by Lehmann (1975) 
is to replace each x; with a corresponding binary transformation 


lify,2 a 
y.= Equation 4.9 


Oify,<x 


where x is the median of the observed data series. The median is calculated in the fol- 
lowing way. Let x;1;Sx;2)S ... SX y; be the observed data series sorted in ascending 
order. Then 


_ X(N +1)/2] if N is odd 


; : Equation 4.10 
(Xpy24 + X[~w42)/2))/2 if N is even 


Once this binary transformation has been made, the runs test can be applied to the binary 
data, as illustrated in the following data set. In addition to the median, the mean, mode, 
or any specified value can be selected as the cut-off for the runs test. 
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Example: Children’s Aggression Scores 


Figure 4.2 displays in the Data Editor the aggression scores for 24 children from a 
study of the dynamics of aggression in young children. These data appear in Siegel 
and Castellan (1988). 


Figure 4.2 Aggression scores in order of occurrence 


Figure 4.3 shows the results of the runs test for these data. 


Figure 4.3 Runs test results for aggression scores data 


Cases < Asymp. Exact 
Test Test Number Sig. Significance Point 
Value’ Value of Runs (2-tailed) (2-tailed) Probability 


SCORE 
1. Median 


25.00 12 10 .297 .301 .081 


To obtain these results, Exact Tests uses the median of the 24 observed scores (25.0) as 
the cut-off for transforming the data into a binary sequence in accordance with Equation 
4.8. This yields the binary sequence 


(1,0, 1, 1, 1, 1,0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0). 


Notice that this binary sequence of 12 1’s and 12 0’s contains 10 runs. Exact Tests 
determines that all permutations of the 12 1’s and 12 0’s would yield anywhere between 
a minimum of 2 runs and a maximum of 24 runs. The exact two-sided p value, or 
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probability of obtaining 10 or fewer runs, is 0.301 and does not indicate any significant 
departure from randomness. 

If the data set had been larger, it would have been difficult to compute the exact test, 
and you would have had to either rely on the asymptotic results or estimate the exact p 
values using the Monte Carlo option. Figure 4.4 shows Monte Carlo estimates of the 
exact p values for the runs test based on 10,000 random permutations of the 12 0’s and 
12 1’s in a binary sequence of 24 numbers. Each permutation is assigned an equally 
likely probability given by 24!/(12!12!) = (172704156). 


Figure 4.4 Monte Carlo results for runs test for aggression scores data 


Monte Carlo Sig. (2-tailed) 
99% Confidence 


Cases < | Cases >= Asymp. aera! 
Test Test Test Total Number Sig. Lower Upper 
Value! Value Value Cases of Runs Z (2-tailed) Sig. Bound Bound 


12 12 10 .297 


2. Based on 10000 sampled tables with starting seed 200000. 


Notice that the Monte Carlo two-sided p value, 0.298, is extremely close to the exact p 
value, 0.310. But more importantly, the Monte Carlo method produces a 99% 
confidence interval within which the exact two-sided p value is guaranteed to lie. In this 
example, the interval is (0.286, 0.310), which again demonstrates conclusively that the 
null hypothesis of a random data series cannot be rejected. 


Example: Small Data Set 


Here is a small hypothetical data set illustrating the difference between the exact and 
asymptotic inference for the runs test. The data consists of a binary sequence of ten 
observations 


(1, 1, 1, 1, 0, 0, 0, 0, 1, 1) 


with six 1’s and four 0’s. Thus, there are 3 runs in this sequence. The results of the runs 
test are displayed in Figure 4.5. 
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Figure 4.5 Runs test results for small data set 


Cases < | Cases >= Asymp. Exact 
Test Test Test Total Number Sig. Significance Point 
Value! Value Value Cases of Runs Z (2-tailed) (2-tailed) Probability 
SCORE 1.00 4 6 10 3 -1.616 106 .071 .038 
1. Median 


Notice that the asymptotic two-sided p value is 0.106, while the exact two-sided p value 


is 0.071. 


Two-Sample Inference: 
Paired Samples 


The tests in this section are commonly applied to matched pairs of data, such as when 
several individuals are being studied and two repeated measurements are taken on each 
individual. The objective is to test the null hypothesis that both measurements came 
from the same population. The inference is complicated by the fact that the two obser- 
vations on the same individual are correlated, while there is independence across the 
different individuals being studied. In this setting, Exact Tests provides statistical pro- 
cedures for both continuous and categorical data. For matched pairs of continuous data 
(possibly with ties) Exact Tests provides the sign test and the Wilcoxon signed-ranks 
test. For matched pairs of binary outcomes, Exact Tests provides the McNemar test. For 
matched pairs of ordered categorical outcomes, Exact Tests generalizes from the Mc- 
Nemar test to the marginal homogeneity test. 


Available Tests 


Table 5.1 shows the available tests for paired samples, the procedure from which they 
can be obtained, and a bibliographical reference for each test. 


Table 5.1 Available tests 


Test Procedure Reference 

Sign test Nonparametric Tests: Sprent (1993) 
Two-Related-Samples Tests 

Wilcoxon signed-ranks test Nonparametric Tests: Sprent (1993) 
Two-Related-Samples Tests 

McNemar test Nonparametric Tests: Siegel and Castellan 
Two-Related-Samples Tests (1988) 


Marginal homogeneity test 


© Copyright IBM Corporation. 1989, 2013 


Nonparametric Tests: 
Two-Related-Samples Tests 
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Agresti (1990) 
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When to Use Each Test 


The tests in this chapter have the common feature that they are applicable to data sets 
consisting of pairs of correlated data. The goal is to test if the first member of the pair 
has a different probability distribution from the second member. The choice of test is 
primarily determined by the type of data being tested: continuous, binary, or categorical. 


Sign test. This test is used when observations in the form of paired responses arise from 
continuous distributions (possibly with ties), but the actual data are not available to us. 
Instead, all that is provided is the sign (positive or negative) of the difference in responses 
of the two members of each pair. 


Wilcoxon signed-ranks test. This test is also used when observations in the form of paired 
responses arise from continuous distributions (possibly with ties). However, you now 
have the sign of the difference. You also have its rank in the full sample of response dif- 
ferences. If this additional information is available, the Wilcoxon signed-ranks test is 
more powerful than the sign test. 


McNemar test. This test is used to test the equality of binary response rates from two 
populations in which the data consist of paired, dependent responses, one from each 
population. It is typically used in a repeated measures situation, in which each subject’s 
response is elicited twice, once before and once after a specified event (treatment) occurs. 
The test then determines if the initial response rate (before the event) equals the final 
response rate (after the event). 


Marginal homogeneity test. This test generalizes the McNemar test from binary response 
to multinomial response. Specifically, it tests the equality of two cx 1 multinomial 
response vectors. Technically, the response could be ordered or unordered. However, 
the methods developed in the present release of Exact Tests apply only to ordered 
response. The data consist of paired, dependent responses, one from population | and 
the other from population 2. Each response falls into one of c ordered categories. The 
data are arranged in the form of a square c X c contingency table in which an entry in 
cell (i, /) signifies that the response of one member of the dependent pair fell into 
category i, while the response of the second member fell into category 7. A typical 
application of the test of marginal homogeneity is a repeated measures situation in 
which each subject’s ordered categorical response is elicited twice, once before and 
once after a specified event (treatment) occurs. The test then determines if the response 
rates in the c ordered categories are altered by the treatment. See Agresti (1990) for 
various model-based approaches to this problem. Exact Tests provides a nonparametric 
solution using the generalized Mantel-Haenszel approach suggested by Kuritz, Landis, 
and Koch (1988). See also White, Landis, and Cooper (1982). 
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Statistical Methods 


For all the tests in this chapter, the data consist of correlated pairs of observations. For 
some tests, the observations are continuous (possibly with ties), while for others the 
observations are categorical. Nevertheless, in all cases, the goal is to test the null 
hypothesis that the two populations generating each pair of observations are identical. 
The basic permutation argument for testing this hypothesis is the same for all the tests. 
By this argument, if the null hypothesis were true, the first and second members of each 
pair of observations could just as well have arisen in the reverse order. Thus, each pair 
can be permuted in two ways, and if there are N pairs of observations, there are 2” 
equally likely ways to permute the data. By actually carrying out these permutations, 
you can obtain the exact distribution of any test statistic defined on the data. 


Sign Test and Wilcoxon Signed-Ranks Test 


The data consist of N paired observations (x), ,), (X2,¥), ---s (Xj Vy), Where the X 
and Y random variables are correlated, usually through a matched-pairs design. Define 
the N differences 


d,=xX;-y;,i=1,2,...,N 
Omit from further consideration all pairs with a zero difference. Assume that for all 


i, |d,| > 0. The following assumptions are made about the distribution of the random 
variables D;: 


1. The distribution of each D; is symmetric. 
2. The D;’s are mutually independent. 
3. The D,’s have the same median. 


Let the common median of the NV D,’s be denoted by i. The null hypothesis is 
Hy: = 0 

There are two one-sided alternative hypotheses of the form 

H,:A>0 

and 

A y:A<0 


The two-sided alternative hypothesis is that either H, or H', holds, but you cannot 
specify which. 
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To test these hypotheses, utilize permutational distributions of tests statistics derived 
from either the signs or the signed ranks of the N differences. Let the absolute values of 
the observed paired differences, arranged in ascending order, be 


ltl S$ rail $ |4tl 


and let 


rays" 2) Sri] 


be corresponding ranks (mid-ranks in the case of tied data). Specifically, if there are m, 
observations tied at the jth smallest absolute value, you assign to all of them the rank 
tae my +...+m,_,+1/2(m,+ 1) Equation 5.1 


For the Wilcoxon signed-ranks test, inference is based on the permutational distribution 
of the test statistic 


N N 
Tsp = min Yi rfOi>, Yi rfD;<% Equation 5.2 
i=l i=1 


whose observed value is 


N N 
top = min oS ri(d;>0), >: rI(D;< 0) Equation 5.3 
i=1 i=l 


where /(-) is the indicator function. It assumes a value of | if its argument is true and 0 
otherwise. In other words, fg is the minimum of ranks of the positive differences and 
the ranks of the negative differences among the N observed differences. 

Sometimes you do not know the actual magnitude of the difference but only have its 
sign available to us. In that case, you cannot rank the differences and so compute the 
Wilcoxon signed-ranks statistic. However, you can still use the information present in 
the sign of the difference and perform the sign test. For the sign test, inference is based 
on the permutational distribution of the test statistic 


N N 
T 5 = min YO;>% Yi rlB; <0 Equation 5.4 
i=l ak 
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whose observed value is 


N N 
t, = min Y Maj> 0), > rI(D;< 9) Equation 5.5 
i=l i=1 


In other words, ¢, is the count of the number of positive differences among the N 
differences. 

The permutational distributions of T;, and 7, under the null hypothesis are 
obtained by assigning positive or negative signs to the N differences in all possible ways. 
There are 2” such possible assignments, corresponding to the reference set 


T = {(sgn(D,), sgn(D,), ...sgn(D,)):sgn(D,) = 1 or -1, fori = 1, 2, ...N} 
Equation 5.6 

and each assignment has equal probability, oo , under the null hypothesis. Exact Tests 

uses network algorithms to enumerate the reference set in Equation 5.6 in order to com- 


pute exact p values. 
From Equation 5.2 and standard binomial theory, the mean of Tsp is 


N 
E(Tsp) = >: r/2 Equation 5.7 
i= 1 


and the variance of Tp is 


O° (Tsp) = y re/4 Equation 5.8 
t=1 

From Equation 5.4 and standard binomial theory, the mean of T, is 

E(Ts) = N/2 Equation 5.9 


and the variance of T, is 


o°(T,) = N/4 Baguation $10 
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For notational convenience, you can drop the subscript and let 7 denote either the 
statistic for the sign test or the statistic for the Wilcoxon signed-ranks test. The p value 
computations that follow are identical for both tests, with the understanding that T 
denotes T;, when the Wilcoxon signed-ranks test is being computed and denotes 7, 
when the sign test is being computed. In either case, you can now denote the 
standardized test statistic as 


T-E(t 
o(T) 


Z= Equation 5.11 


The two-sided asymptotic p value is defined, by the symmetry of the normal distribu- 
tion, to be double the one-sided p value: 


P, = 2p, Equation 5.12 


The exact one-sided p value is defined as 


p= eee if t> E(T) Equation 5.13 


Pr(T<¢) if t< E(T) 


where ¢ is the observed value of 7. The potential to misinterpret a one-sided p value 
applies in the exact setting, as well as in the asymptotic case. The exact two-sided p 
value is defined to be double the exact one-sided p value: 

Poe 2p. Equation 5.14 


1 


This is a reasonable definition, since the exact permutational distribution of 7 is sym- 
metric about its mean. 

The one-sided Monte-Carlo p value is obtained as follows. First, suppose that 
t > E(T), so that you are estimating the right tail of the exact distribution. You sample 
M times from the reference set (I) of z possible assignments of signs to the ranked 
data. Suppose that the ith sample generates a value ft; for the test statistic. Define the 
random variable 


i 


lift,2¢ 
0 otherwise 
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An unbiased Monte Carlo point estimate of the one-sided p value is 


P, = Z/M Equation 5.15 


i= 


Next, if ¢< E(T) , so that you are estimating the left tail of exact distribution, the random 
variable is defined by 


i 


oe lif (t,< 1) 


0 otherwise 


The Monte Carlo point estimate of the one-sided p value is once again given by 
Equation 5.15. 
A 99% confidence interval for the exact one-sided p value is 


Cl = p, 42.576 [(p,)U-p,)/M Equation 5.16 


The constant in the above equation, 2.576, is the upper 0.005 quantile of the standard 
normal distribution. It arises because Exact Tests chooses a 99% confidence interval for 
the p value as its default. However, you can easily choose any confidence level for the 
Monte Carlo estimate of the p value. Ordinarily, you would not want to lower the level 
of the Monte Carlo confidence interval to below the 99% default, since there should be 
a high assurance that the exact p value is contained in the confidence interval. 

A technical difficulty arises when either p = 0 or p = 1. Now the sample standard 
deviation is 0, but the data do not support a confidence interval of zero width. An alter- 
native approach in this extreme situation is to invert an exact binomial hypothesis test. 
It can be easily shown that if p = 0, an «% confidence interval for the exact p value is 


cre (or sdearsi00 4 Equation 5.17 
Similarly, when Pp = 1, an o% confidence interval for the exact p value is 
CI = [(1-«/100)'”™, 1] Equation 5:18 


By symmetry, the two-sided Monte Carlo p value is twice the one-sided p value: 


P, og 2p, Equation 5.19 
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You can show that the variance of the two-sided Monte Carlo p value is four times as 
large as the variance of the corresponding one-sided Monte Carlo p value. The 
confidence interval for the true two-sided p value can thus be adjusted appropriately, 
based on the increased variance. 


Example: AZT for AIDS 


The data shown in Figure 5.1, from Makutch and Parks (1988), document the response 
of serum antigen level to AZT in 20 AIDS patients. Two sets of antigen levels are 
provided for each patient: pre-treatment, represented by preazt, and post-treatment, 
represented by postazt. 


Figure 5.1 Response of serum antigen level to AZT 


postazt 
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Figure 5.2 shows the results for the Wilcoxon signed-ranks test. 


Figure 5.2 Wilcoxon signed-ranks test results for AZT data 


Ranks 
N Mean Rank |Sum of Ranks 
Serum Antigen Level | Negative 1 
Post AZT - Serum Ranks 2 aet 1200 
Antigen Level (pg/ml) Positi 
ositive 2 
Pre-AZT Ranks 14 8.86 124.00 
Ties 48 
Total 20 


1. Serum Antigen Level Post AZT < Serum Antigen Level (pg/ml) Pre-AZT 
2. Serum Antigen Level Post AZT > Serum Antigen Level (pg/ml) Pre-AZT 
3. Serum Antigen Level Post AZT = Serum Antigen Level (pg/ml) Pre-AZT 


Test Statistics! 


Asymp. Exact 
Sig. Significance |Exact Sig. Point 
Z (2-tailed) (2-tailed) (1-tailed) | Probability 


Serum 
Antigen 
Level 
Post 
AZT - 
Serum 
Antigen 
Level 
(pg/ml) 
Pre-AZT 


2 
-2.896 .004 .002 .001 .000 


1. Wilcoxon Signed Ranks Test 
2. Based on negative ranks. 


The test statistic is the smaller of the two sums of ranks, which is 12. The exact one-sided 
p value is 0.001, about half the size of the asymptotic one-sided p value. To obtain the 
asymptotic one-sided p value, divide the asymptotic two-sided p value, 0.004, by 2 
((0.004)/2 = 0.002 ). If this data set had been extremely large, you might have preferred 
to compute the Monte Carlo estimate of the exact p value. The Monte Carlo estimate 
shown in Figure 5.3 is based on sampling 10,000 times from the reference set I, defined 
by Equation 5.6. 
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Figure 5.3. Monte Carlo results of Wilcoxon signed-ranks test for AZT data 


Test Statistics!? 


Monte Carlo Sig. (2-tailed) Monte Carlo Sig. (1-tailed) 
99% Confidence 99% Confidence 
Asymp. Interval Interval 
Sig. Lower Upper Lower Upper 
Z (2-tailed) Sig. Bound Bound Sig. Bound Bound 
Serum 
Antigen 
Level 
Post 
AZT - 3 
-2.896 .004 .002 .001 .004 .001 .0002 .0018 
Serum 
Antigen 
Level 
(pg/ml) 
Pre-AZT 


1. Wilcoxon Signed Ranks Test 


2. Based on 10000 sampled tables with starting seed 2000000. 


3. Based on negative ranks. 


The Monte Carlo point estimate of the exact one-sided p value is 0.001, very close to the 
exact answer. Also, the Monte Carlo confidence interval guarantees with 99% confidence 
that the true p value is in the range (0.0002, 0.0018). This guarantee is unavailable with 
the asymptotic method; thus, the Monte Carlo estimate would be the preferred option for 


large samples. 


Next, the exact sign test is run on these data. The results are displayed in Figure 5.4. 


Figure 5.4 Sign test results for AZT data 


Frequencies 
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N 
Serum Antigen Level | Negative 2 
Post AZT - Serum Differences 
Sig es (pg/ml) Positive ij 
Tes Differences 
Ties? 4 
Total 20 
1. Serum Antigen Level Post AZT < Serum 
Antigen Level (pg/ml) Pre-AZT 
2. Serum Antigen Level Post AZT > Serum 
Antigen Level (pg/ml) Pre-AZT 
3. Serum Antigen Level Post AZT = Serum 
Antigen Level (pg/ml) Pre-AZT 
Test Statistics! 
Statistics 
Exact Sig. Point 
Exact Significance (2-tailed) | (1-tailed) | Probability 
Pairs | Serum 
Antigen 
Level 
Post 
Reale 0047" 002” .002 
Serum 
Antigen 
Level 
(pg/ml) 
Pre-AZT 
1. Sign Test 


2. Exact results are provided instead of Monte Carlo for this test. 


3. Binomial distribution used. 
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The exact one-sided p value is 0.002. Notice that the exact one-sided p value for the sign 
test, while still extremely significant, is nevertheless larger than the corresponding exact 


one-sided p value for the Wilcoxon signed-ranks test. Since the sign test only takes into 


account the signs of the differences and not their ranks, it has less power than the 
Wilcoxon signed-ranks test. This accounts for its higher exact p value. The corresponding 


asymptotic inference fails to capture this distinction. 
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McNemar Test 


The McNemar test (Siegel and Castellan, 1988; Agresti, 1990) is used to test the 
equality of binary response rates from two populations in which the data consist of 
paired, dependent responses, one from each population. It is typically used in a repeated 
measurements situation in which each subject’s response is elicited twice, once before 
and once after a specified event (treatment) occurs. The test then determines if the initial 
response rate (before the event) equals the final response rate (after the event). Suppose 
two binomial responses are observed on each of N individuals. Let y,, be the count of 
the number of individuals whose first and second responses are both positive. Let y,, 
be the count of the number of individuals whose first and second responses are both 
negative. Let y,, be the count of the number of individuals whose first response is 
positive and whose second response is negative. Finally, let y,, be the count of the 
number of individuals whose first response is negative and whose second response is 
positive. Then the McNemar test is defined on a single 2 x 2 table of the form 


_Fiu Yi2 
Yi2 22 


Let (1, 11, %>), >) denote the four cell probabilities for this table. The null 
hypothesis of interest is 


Ay:T yy = My] 


The McNemar test depends only on the values of the off-diagonal elements of the 2 x 2 
table. Its test statistic is 


MC(Y ) = V2 -¥ 21 Equation 5.20 
Now let y represent any generic 2 x 2 contingency table, and suppose that x is the 2 x 2 
table actually observed. The exact permutation distribution of the test statistic (see 


Equation 5.20) is obtained by conditioning on the observed sum of off-diagonal terms, 
or discordant pairs, 


Na = Viz tar 


The reference set is defined by 


T= {yty is 2X 23y 1) +o, = Nyt Equation 5.21 
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Under the null hypothesis, the conditional probability, P(yv) , of observing any y € T 
is binomial with parameters (0.5, N,) . Thus, 


0.5) "Nv! 
_ (0.5) Ny! 


Vi2'Voq! 


PAY) Equation 5.22 


and the probability that the McNemar statistic equals or exceeds its observed value 
MC(x), is readily evaluated as 


Pr(MC(y) > MC(x)) = » Py) Equation 5.23 
MC(y) > MC(x) 


the sum being taken over all y € I’. The probability that the McNemar statistic is less than 
or equal to MC(x) 1s similarly obtained. The exact one-sided p value is then defined as 


p, = min{Pr(MC(y ) < MC(x )), Pr(MC(y ) > MC(x ))} Equation 5.24 


You can show that the exact distribution of the test statistic MC(y ) is symmetric about 
0. Therefore, the exact two-sided p value is defined as double the exact one-sided p value: 


P2 >= 2p) Equation 5.25 


In large samples, the two-sided asymptotic p value is calculated by a ‘a approximation 
with a continuity correction, and 1 degree of freedom, as shown in Equation 5.26. 


2 

Wi2 -Yai] - D 
ae P12 Yai] 
Ng 


Equation 5.26 


The definition of the one-sided p value for the exact case as the minimum of the left and 
right tails must be interpreted with caution. It should not be concluded automatically, 
based on a small one-sided p value, that the data have yielded a statistically significant 
outcome in the direction originally hypothesized. It is possible that the population 
difference occurs in the opposite direction from what was hypothesized before gathering 
the data. The direction of the difference can be determined from the sign of the test 
statistic, calculated as shown in Equation 5.27. 


MC(y) = Yy2-Y1 Equation 5.27 


You should examine the one-sided p value as well as the sign of the test statistic before 
drawing conclusions from the data. 
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Example: Voters’ Preference 


The following data are taken from Siegel and Castellan (1988). The crosstabulation 
shown in Figure 5.5 shows changes in preference for presidential candidates before and 
after a television debate. 


Figure 5.5 Crosstabulation of preference for presidential candidates before and after TV 
debate 


Preference Before TV Debate * Preference After TV 
Debate Crosstabulation 


Count 
Preference After TV 
Debate 
Carter Reagan 
Preference Carter 28 13 
Before TV Debate Reagan 7 27 


The results of the McNemar test for these data are shown in Figure 5.6. 


Figure 5.6 McNemar test results 


Test Statistics! 
Exact 
Significance |Exact Sig. Point 
N (2-tailed) (1-tailed) | Probability 
Preference 
Before TV 
2 
Debates 75 263 132 074 
Preference 
After TV 
Debate 


1. McNemar Test 
2. Binomial distribution used. 


The exact one-sided p value is 0.132. Notice that the value of the McNemar statistic, 
13 —7, has a positive sign. This indicates that of the 20 (13 + 7 ) discordant pairs, more 
switched preferences from Carter to Reagan (13) than from Reagan to Carter (7). The 
point probability, 0.074, is the probability that MC(yv) = MC(x) = 13-7 = 6. 
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Marginal Homogeneity Test 


The marginal homogeneity test (Agresti, 1990) is an extension of the McNemar test 
from two categories to more than two categories. The data are thus defined on a square 
c Xc contingency table in which the row categories represent the first member of a pair 
of correlated observations, and the column categories represent the second member of 
the pair. In Exact Tests, the categories are required to be ordered. The data are thus rep- 
resented by a c x c contingency table with entry (x,;) in row i and column/. This entry 
is the count of the number of pairs of observations in which the first member of the pair 
falls into ordered category i and the second member into ordered category j. Let 1; be 
the probability that the first member of the matched pair falls in row j. Let m', be the 
probability that the second member of the matched pair falls in column /. The null hy- 
pothesis of marginal homogeneity states that 


Ay: =p forall j = 1,2,...c 


In other words, the probability of being classified into category / is the same for the first 
as well as the second member of the matched pair. 

The marginal homogeneity test for ordered categories can be formulated as a 
stratified 2 x c contingency table. The theory underlying this test, the definition of its 
test statistic, and the computation of one- and two-sided p values are discussed in Kuritz, 
Landis, and Koch (1988). 


Example: Matched Case-Control Study of Endometrial Cancer 


Figure 5.7, taken from the Los Angeles Endometrial Study (Breslow and Day, 1980), 
displays a crosstabulation of average doses of conjugated estrogen between cases and 
matched controls. 


Figure 5.7 Crosstabulation of dose for cases with dose for controls 


Dose (Cases) * Dose (Controls) Crosstabulation 


Count 


Dose (Controls) 
.0000 .2000 5125 .7000 
Dose .0000 2 3 1 
(Cases) | 9000 4 2 1 
5125 2 3 1 
.7000 12 1 2 1 
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In this matched pairs setting, the test of whether the cases and controls have the same 
exposure to estrogen, is equivalent to testing the null hypothesis that the row margins 
and column margins come from the same distribution. The results of running the exact 
marginal homogeneity test on these data are shown in Figure 5.8. 


Figure 5.8 Marginal homogeneity results for cancer data 


Marginal Homogeneity Test 


Std. 
Observed Mean Deviation Asymp. Exact 
Distinct | Off-Diagonal MH MH of MH Std. MH Sig. Significance |Exact Sig. Point 
Values Cases Statistic | Statistic | Statistic | Statistic | (2-tailed) (2-tailed) (1-tailed) | Probability 


Dose 
(Cases) 
& Dose 
(Controls) 


6.687 12.869 1.655 -3.735 .000 000 000 -000 


The p values are extremely small, showing that the cases and controls have significantly 
different exposures to estrogen. The null hypothesis of marginal homogeneity is rejected. 


Example: Pap-Smear Classification by Two Pathologists 


This example is taken from Agresti (1990). Two pathologists classified the Pap-smear 
slides of 118 women in terms of severity of lesion in the uterine cervix. The classifica- 
tions fell into five ordered categories. Level J is negative, Level 2 is atypical squamous 
hyperplasia, Level 3 is carcinoma in situ, Level 4 is squamous carcinoma, and Level 5 is 
invasive carcinoma. Figure 5.9 shows a crosstabulation of level classifications between 
two pathologists. 


Figure 5.9 Crosstabulation of Pap-smear classifications by two pathologists 


First Pathologist * Pathologist 2 Crosstabulation 


Count 
Pathologist 2 
Level 1 Level 2 Level 3 Level 4 Level 5 

First Level 1 22 2 2 
Pathologist Level 2 5 7 14 

Level 3 2 36 

Level 4 1 14 7 

Level 5 3 3 
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The question of interest is whether there is agreement between the two pathologists. One 
way to answer this question is through the measures of association discussed in Part 4. 
Another way is to run the test of marginal homogeneity. The results of the exact 
marginal homogeneity test are shown in Equation 5.10. 


Figure 5.10 Results of marginal homogeneity test 


Marginal Homogeneity Test 


First 
Pathologist 
& 
Pathologist 
2 


Distinct 
Values 


Off-Diagonal 


Cases 


43 


Observed 
MH 
Statistic 


114.000 


Statistic 


118.500 


Std. 
Deviation 
of MH 
Statistic 


3.905 


Std. MH 
Statistic 


-1.152 


Asymp. 
Sig. 
(2-tailed) 


249 


Exact 
Significance 
(2-tailed) 


307 


Exact Sig. 
(1-tailed) 


154 


Point 
Probability 


053, 


The exact two-sided p value is 0.307, indicating that the classifications by the two 
pathologists are not significantly different. Notice, however, that there is a fairly large 
difference between the exact and asymptotic p values because of the sparseness in the 
off-diagonal elements. 


Two-Sample Inference: 
Independent Samples 


This chapter discusses tests based on two independent samples of data drawn from two 
distinct populations. The objective is to test the null hypothesis that the two populations 
have the same response distributions against the alternative that the response distribu- 
tions are different. The data could also arise in randomized clinical trials in which each 
subject is assigned randomly to one of two treatments. The goal is to test whether the 
treatments differ with respect to their response distributions. Here it is not necessary to 
make any assumptions about the underlying populations from which these subjects 
were drawn. Lehmann (1975) has demonstrated clearly that the same statistical meth- 
ods are applicable whether the data arose from a population model or a randomization 
model. Thus, no distinction will be made between the two ways of gathering the data. 

There are important differences between the structure of the data for this chapter and 
the previous one. The data in this chapter are independent both within a sample and 
across the two samples, whereas the data in the previous chapter consisted of N 
matched (correlated) pairs of observations with independence across pairs. Moreover, 
in the previous chapter, the sample size was required to be the same for each sample, 
whereas in this chapter, the sample size may differ, with n,; being the size of sample 
JS V2. 


Available Tests 


Table 6.1 shows the available tests for two independent samples, the procedure from 
which they can be obtained, and a bibliographical reference for each test. 


Table 6.1 Available tests 


Test Procedure Reference 

Mann-Whitney test Nonparametric Tests: Two Independent Sprent (1993) 
Samples 

Kolmogorov-Smirnov test | Nonparametric Tests: Two Independent Conover (1980) 
Samples 

Wald-Wolfowitz runs test | Nonparametric Tests: Two Independent Gibbons (1985) 
Samples 
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When to Use Each Test 


The tests in this chapter deal with the comparison of samples drawn from the two distri- 
butions. The null hypothesis is that the two distributions are the same. 

The choice of test depends on the type of alternative hypothesis you are interested in 
detecting. 


Mann-Whitney test. The Mann-Whitney test, or Wilcoxon rank-sum test, is one of the 
most popular two-sample tests. It is generally used to detect “shift alternatives.” That is, 
the two distributions have the same general shape, but one of them is shifted relative to 
the other by a constant amount under the alternative hypothesis. This test has an asymp- 
totic relative efficiency of 95.5% relative to the Student’s ¢ test when the underlying 
populations are normal. 


Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test is a distribution-free test for 
the equality of two distributions against the general alternative that they are different. 
Because this test attempts to detect any possible deviation from the null hypothesis, it 
will not be as powerful as the Mann-Whitney test if the alternative is that one distribu- 
tion is shifted with respect to the other. One-sided forms of the Kolmogorov-Smirnov 
test can be specified and are powerful against the one-sided alternative that one distri- 
bution is stochastically larger than the other. 


Wald-Wolfowitz runs test. The Wald-Wolfowitz runs test is a competitor to the 
Kolmogorov-Smirnov test for testing the equality of two distributions against general 
alternatives. It will not be powerful against specific alternatives such as the shift alternative, 
but it is a good test when no particular alternative hypothesis can be specified. This test is 
even more general than the Kolmogorov-Smimov test in the sense that it has no one-sided 
version. 


Statistical Methods 


The data for all of the tests in this chapter consist of two independent samples, each of 
size n;,j = 1,2, where n; +n, = N. These N observations can be represented in the 
form of the one-way layout shown in Table 6.2. 


This table, denoted by u, displays the observed one-way layout of raw data. The obser- 
vations in u arise from continuous univariate distributions (possibly with ties). Let the 
formula 


Fy) = Pr(VS y|J), 7 = 1,2 Equation 6.1 
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Table 6.2 One-way layout for two independent samples 


Samples 


Un 


denote the distribution from which the 1; observations displayed in column j of the one- 
way layout were drawn. The goal is to test the null hypothesis 


Ay: Fy =F, Equation 6.2 


The observations in uw are independent both within and across columns. In order to test 
Hy by nonparametric methods, it is necessary to replace the original observations in the 
one-way layout with corresponding scores. These scores represent various ways of rank- 
ing the data in the pooled sample of size N. Different tests utilize different scores. Let 
w;; be the score corresponding to u,;. Then the one-way layout, in which the original 
data have been replaced by scores, is represented by Table 6.3. 


Table 6.3 One-way layout with scores replacing original data 


Samples 
1 2 
Wii Wi2 
W21 W22 
Ww 


Wal 


This table, denoted by w, displays the observed one-way layout of scores. Inference 
about H) is based on comparing this observed one-way layout to others like it, in which 
the individual w,; elements are the same but they occupy different rows and columns. 
In order to develop this idea more precisely, let the set W denote the collection of all pos- 
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sible two-column one-way layouts, with n, elements in column | and n, elements in 
column 2, whose members include w and all its permutations. The random variable w 
is a permutation of w if it contains precisely the same scores as w, but these scores have 
been rearranged so that, for at least one (i, 7), (i',j') pair, the scores w; ; and w, » are 
interchanged. 


Formally, let 
W = {w: w = w, or w is a permutation of w} Equation 6.3 


where w is a random variable, and w is a specific value assumed by it. 

To clarify these concepts, let us consider a simple numerical example. Let the 
original data come from two independent samples of size 5 and 3, respectively. These 
data are displayed as the one-way layout shown in Table 6.4. 


Table 6.4 One-way layout of original data 


Samples 
1 2 
27 38 
30 9 
55 27 
72 
18 


As you will see in “Mann-Whitney Test” on p. 83, in order to perform the Mann- 
Whitney test on these data, the original data must be replaced by their ranks. The one- 
way layout of observed scores, based on replacing the original data with their ranks, is 
displayed in Table 6.5. 


Table 6.5 One-way layout with ranks replacing original data 


Samples 


3.5 


N eONWN 


This one-way layout of ranks is denoted by w. It is the one actually observed. Notice that 
two observations were tied at 27 in u. Had they been separated by a small amount, they 
would have ranked 3 and 4. But since they are tied, the mid-rank (3 + 4)/2 = 3.5 is 
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used as the rank for each of them in w. The symbol W represents the set of all possible 
one-way layouts whose entries are the eight numbers in w, with five numbers in column 
1 and three numbers in column 2. Thus, w is one member of W. (It is the one actually 
observed.) Another member is w', representing a different permutation of the numbers 
in w, as shown in Table 6.6. 
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Table 6.6 Permutation of the observed one-way layout of scores 


Samples 
1 2 
6 5 
1 8 
3.5 7 
3:5 
2 


All of the test statistics in this chapter are univariate functions of w € W. Let the test sta- 
tistic be denoted by 7(w) =T and its observed value be denoted by ¢(w) =¢ . The func- 
tional form of T(w) will be defined separately for each test, in subsequent sections of 
this chapter. Following is a discussion of how the null distribution of T may be derived 
in general, and how it is used for p value computations. 


The Null Distribution of T 


In order to test the null hypothesis, H,, you need to derive the distribution of T under 
the assumption that H is true. This distribution is obtained by the following permuta- 
tional argument: 


If Ho is true, every member w € W has the same probability of being observed. 


Lehmann (1975) has shown that the above permutational argument is valid whether the 
data were gathered independently from two populations or by assigning N subjects to 
two treatments in accordance with a predetermined randomization rule. No distinction 
is made between these two ways of gathering the data, although one usually applies to 
observational studies and the other to randomized clinical trials. 

It follows from the above permutational argument that the exact probability of ob- 
serving any we W is 


ten oe i=1"1° ; 
(w) = ——— Equation 6.4 
N 


which does not depend on the specific way in which the original one-way layout, w, was 
permuted. Then 


Pr(T =f) = oe h(w) Equation 6.5 
TW) =t 
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the sum being taken over all we€ W. Similarly, the right-tail of the distribution of T is 
obtained as 


Pr(T>t) = > h(w) Equation 6.6 


T(w)2=t 


The probability distribution of 7 and its tail areas are obtained in Exact Tests by fast nu- 
merical algorithms. In large samples, you can obtain an asymptotic approximation for 
Equation 6.6. Different approximations apply to the different tests in this chapter and are 
discussed in the section dealing with the specific tests. 


P Value Calculations 


The p value is the probability, under H , of obtaining a value of the test statistic at least 
as extreme as the one actually observed. This probability is computed as the tail area of 
the null distribution of the test statistic. The choice of tail area, left-tail, right-tail, or two- 
tails, depends on whether you are interested in a one- or two-sided p value, and also on 
the type of alternative hypothesis you want to detect. The three statistical tests discussed 
in this chapter are all different in this respect. For the Mann-Whitney test, both one- and 
two-sided p values are defined, and they are computed as left, right, or two-tailed 
probabilities, depending on the alternative hypothesis. For the Kolmogorov-Smimmov 
test, the p values are computed from the right tail as two-sided p values, depending on 
how the test statistic is defined. Finally, for the Wald-Wolfowitz runs test, only two- 
sided p values exist, and they are always computed from the left tail of the null 
distribution of the test statistic. Because of these complexities, it is more useful to define 
the p value for each test when the specific test is discussed. 


Mann-Whitney Test 


The Mann-Whitney test is one of the most popular nonparametric two-sample tests. 
Indeed, the original paper by Frank Wilcoxon (1945), in which this test was first 
presented, is one of the most widely referenced statistical papers of all time. For a detailed 
discussion of this test, see Lehmann (1975). It is assumed that sample | consists of n, 
observations drawn from the distribution F, and that sample 2 consists of n, 
observations drawn for the distribution F’, . The null hypothesis is given by Equation 6.2. 
The Wilcoxon test is especially suited to detecting departures from the null hypothesis, 
in which fF’, is shifted relative to F', according to the alternative hypothesis 


H,: Fj(v) = Fy(v-8) Equation 6.7 
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The shift parameter 6 is unknown. If it can be specified a priori that @ must be either 
positive or negative, the test is said to be one-sided, and a one-sided p value can be used 
to decide whether to reject H,. On the other hand, when it is not possible to specify a 
priori what the sign of 8 ought to be, the test is said to be two-sided. In that case, the 
two-sided p value is used to decide if Hy can be rejected. 

Before specifying how the one- and two-sided p values are computed, the test statistic 

T(w)=T must be defined. The first step is to replace the raw data, u, by corresponding 
scores, w. For the Mann-Whitney test, the score, w,; , replacing the original observation, 
u; ;, 18 simply the rank of that wu; ij in the pooled sample of N = n, +n, observations. 
If there are no ties among the u, ,’s, the Nranks thus substituted into the one-way layout 
will simply be some pemnutetion of the first N integers. If there are ties in the data, how- 
ever, use mid-ranks instead of ranks. 

In order to define the mid-ranks formally, let a,;;;S4@).)S...a,y,; denote the 
pooled sample of all of the NV observations in wu, represented as a single row of data sorted 
in ascending order. To allow for the possibility of ties, let there be g distinct observations 
among the sorted a,;,’s with e, distinct observations being equal to the smallest value, 
e, distinct observations being equal to the second smallest value, e; distinct 
observations being equal to the third smallest value, and so forth, until finally e, distinct 
observations are equal to the largest value. It is now possible to define the mid-ranks 
precisely. For / = 1, 2, ...g, the distinct mid-rank assumed by all the e, observations 
tied in the /th smallest position is w; = e,+e,+...+e,_,+(e,+1)/2. 

Finally, you can determine the a,;,, and herice the corresponding u,;, with which 
each w; is associated. You can then substitute the appropriate w, in place of the uj; in 
the one-way layout w. In this manner you replace u, the original one-way layout oft raw 
data, with w, the corresponding one-way layout of mid-ranks, whose individual 
elements, w;;, are the appropriate members of the set of the g distinct mid-ranks 
(Wi, Wo, ‘¢)- The set W of all possible permutations w is defined by Equation 6. 3. 

The Wilcoxon rank-sum test statistic for the first column (or sample), T(w)=T , 
defined as the sum of mid-ranks of the first column (or sample) in the two-way iyaak 
w. That is, forany we W, 


Ps >: Wij Equation 6.8 


Its mean is 


E(T) = n(n, +n, +1)/2 Equation 6.9 
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its variance is 


g 2 
nyn x; _ ,e(e,-1) 
var(T) = aie ny +n o+1-——=1 Pu Equation 6.10 
12 (1, +1)(n, +2,-1) 


and its observed value is 


ai 
t= by Wi Equation 6.11 
tre 


The Wilcoxon rank-sum test statistic for the second column (or sample) is defined 


similarly. 

In its Mann-Whitney form, this observed statistic is defined by subtracting off a 
constant: 
u = t—n,(n,+1)/2 Equation 6.12 


The Wilcoxon rank-sum statistic corresponding to the column with the smaller Mann- 
Whitney statistic is displayed and used as the test statistic. 


Exact P Values 


The Wilcoxon rank-sum test statistic, 7, is considered extreme if it is either very large 
or very small. Large values of 7 indicate a departure from the null hypothesis in the 
direction @ > 0, while small values of T indicate a departure from the null hypothesis in 
the opposite direction, <0. Whenever the test statistic possesses a directional 
property of this type, it is possible to define both one- and two-sided p values. The exact 
one-sided p value is defined as 


Pp, = min{Pr(T2 2), Pr(T<t)} Equation 6.13 


and the exact two-sided p value is defined as 


Py = Pr(|T-E(T)| 2 |t- E(D)|) Equation 6.14 
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Monte Carlo P Values 


When exact p values are too difficult to compute, you can estimate them by Monte Carlo 
sampling. The following steps show how you can use Monte Carlo to estimate the exact p 
value given by Equation 6.14. The same procedure can be readily adapted to Equation 6.13. 


1. Generate a new one-way layout of scores by permuting the original layout, w, in one 
of the N!/(n,!n,!) equally likely ways. 


2. Compute the value of the test statistic T for the permuted one-way layout. 


3. Define the random variable 


lif |T-E [>|t-E 
z={ if | (T)| =| (T)| Equation 6.15 


0 otherwise 


Repeat the above steps a total of M times to generate the realizations (Z,, 2, ...Z,,) for 
the random variable Z. Then an unbiased estimate of p, is 


M 
n 2) = 17) 
| a Equation 6.16 
2 M 
Next, let 
M 1/2 
6 J (z,;—p a Equation 6.17 
= _——— — uation 0. 
M-1 Sy I~P, " 


l=1 


be the sample standard deviation of the z,’s. Then a 99% confidence interval for the exact 
p value is 


Crs p,+2.5766//M Equation 6.18 


A technical difficulty arises when either P, = 0 or p, = 1. Now the sample standard 
deviation is 0 but the data do not support a confidence interval of zero width. An 
alternative way to compute a confidence interval that does not depend on o is based on 
inverting an exact binomial hypothesis test when an extreme outcome is encountered. It 
can be easily shown that if P, = 0,an o&% confidence interval for the exact p value is 


Cra (0.1 A= 67100) Equation 6.19 
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Similarly, when P, = 1,an &% confidence interval for the exact p value is 


Cr = (1-100), 1] Equation 6.20 


Exact Tests uses default values of M = 10000 and a = 99%. While these defaults can 
be easily changed, they provide quick and accurate estimates of exact p values for a wide 
range of data sets. 


Asymptotic P Values 


The one- and two-sided p values are obtained by computing the normal approximations 
to Equation 6.13 and Equation 6.14, respectively. Thus, the asymptotic one-sided p value 
is defined as 


p, = min{ @((t-E(T))/o7), 1- ((t- E(1))/67)} Equation 6.21 


and the asymptotic two-sided p value is defined as 


P, = 2p Equation 6.22 
1 


where ®(z) is the tail area to the left of z from a standard normal distribution, and o, 
is the standard deviation of 7, obtained by taking the square root of 7.10. 


Example: Blood Pressure Data 


The diastolic blood pressure (mm Hg) was measured on 4 subjects in a treatment group 
and 11 subjects in a control group. Figure 6.1 shows the data displayed in the Data Editor. 
The data consist of two variables—pressure is the diastolic blood pressure of each 
subject, and group indicates whether the subject was in the experimentally treated group 
or the control group. 


88 


Diastolic blood pressure of treated and control groups 


Chapter 6 
Figure 6.1 
pressure group 
94 Treated 
108 Treated 
110 Treated 
90 Treated 
80 Control 
94 Control 
85 Control 
90 Control 
90 Control 
90 Control 
108 Control 
94 Control 
78 Control 
105 Control 
88 Control 


The Mann-Whitney test is computed for these data. The results are displayed in Figure 6.2. 


Figure 6.2 Mann-Whitney results for diastolic blood pressure data 


Ranks 
Mean Sum of 
N Rank Ranks 
Diastolic | Treatment | Treated 4 11.25 45.00 
Blood —_| Group Control 11 6.82 | 75.00 
Pressure 
Total 15 
Test Statistics! 
Asymp. Exact Sig. Exact 
Mann-Whitney | Wilcoxon Sig. [2*(1-tailed | Significance |Exact Sig. Point 
U Ww Z (2-tailed) Sig.)] (2-tailed) (1-tailed) | Probability 
Diastolic 
Blood 9.000 75.000 -1.720 .085 104 .099 .054 .019 
Pressure 


1. Grouping Variable: Treatment Group 


2. Not corrected for ties. 
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The Mann-Whitney statistic for the treated group, calculated by Equation 6.12, is 35.0 
and for the control group is 9.0. Thus, the Wilcoxon rank-sum statistic for the control 
group is used. The observed Wilcoxon rank-sum statistic is 75. The Mann-Whitney U 
statistic is 9.0. The exact one-sided p value, 0.054, is not statistically significant at the 
5% level. In this data set, the one-sided asymptotic p value, calculated as one-half of the 
two-sided p value, 0.085, is 0.0427. This value does not accurately represent the exact 
p value and would lead you to the erroneous conclusion that the treatment group is sig- 
nificantly different from the control group at the 5% level of significance. 

Although it is not necessary for this small data set, you can compute the Monte Carlo 
estimate of the exact p value. The results of the Monte Carlo analysis, based on 10,000 
random permutations of the original one-way layout, are displayed in Figure 6.3. 


Figure 6.3. Monte Carlo results for diastolic blood pressure data 


Test Statistics! 


Monte Carlo Sig. (2-tailed) Monte Carlo Sig. (1-tailed) 
99% Confidence 99% Confidence 
Asymp. Exact Sig. Interval Interval 
Mann-Whitney | Wilcoxon Sig. [2*(1-tailed Lower Upper Lower Upper 
U Ww Z (2-tailed) Sig.)] Sig. Bound Bound Sig. Bound Bound 
Diastolic ‘ 4 3 
Blood 9.000 75.000 -1.720 .085 .104 -102 .094 -110 .056 .050 062 
Pressure 


te Grouping Variable: Treatment Group 
* Not corrected for ties. 
3. Based on 10000 sampled tables with starting seed 2000000. 


Observe that the Monte Carlo estimate, 0.056, agrees very closely with the exact p value 
of 0.054. Now observe that with 10,000 Monte Carlo samples, the exact p value is 
contained within the limits (0.050, 0.062) with 99% confidence. Since the threshold p 
value, 0.05, falls on the boundary of this interval, it appears that 10,000 Monte Carlo 
samples are insufficient to conclude that the observed result is not statistically 
significant. Accordingly, to confirm the exact results, you can next perform a Monte 
Carlo analysis with 30,000 permutations of the original one-way layout. The results are 
shown in Figure 6.4. This time, the 99% confidence interval is much tighter and does 
indeed confirm with 99% confidence that the exact p value exceeds 0.05. 
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Figure 6.4 Monte Carlo results with 30,000 samples for diastolic blood pressure data 


Test Statistics! 


Monte Carlo Sig. (2-tailed) Monte Carlo Sig. (1-tailed) 
99% Confidence 99% Confidence 
Asymp. Exact Sig. Interval Interval 
Mann-Whitney | Wilcoxon Sig. [2*(1-tailed Lower Upper Lower Upper 
U Ww Z (2-tailed) Sig.)] Sig. Bound Bound Sig. Bound Bound 
Diastolic S 3 7 
Blood 9.000 75.000 -1.720 .085 104 102 .098 107 056 053 059 
Pressure 


rE Grouping Variable: Treatment Group 
2. Not corrected for ties. 
3. Based on 3000 sampled tables with starting seed 20000000. 


Kolmogorov-Smirnov Test 


The Kolmogorov-Smirnov test is applicable in more general settings than the Mann- 
Whitney test. Both are tests of the null hypothesis (see Equation 6.2). However, the 
Kolmogorov-Smirnov test is a universal test with good power against general 
alternatives in which F', and F, can differ in both shape and location. The Mann- 
Whitney test has good power against location shift alternatives of the form shown in 
Equation 6.7. 

The Kolmogorov-Smirnov test is a two-sided test having good power against the al- 
ternative hypothesis 


H,: F(v) #F(v), for at least one value of v Equation 6.23 


The Kolmogorov-Smirnov statistics used for testing the hypothesis in Equation 6.23 can 
now be defined. These statistics are all functions of the empirical cumulative density 
function (CDF) for F', and the empirical CDF for F, . “Statistical Methods” on p. 78 
stated that the test statistics in this chapter are all functions of the one-way layout, w, 
displayed in Table 6.3, in which the original data have been replaced by appropriate 
scores. Indeed, this is true here as well, since you could use the original data as scores 
and construct an empirical CDF for each of the two samples of data. In that case, you 
would use w = wu as the one-way layout of scores. Alternatively, you could first convert 
the original data into ranks, just like those for the Mann-Whitney test, and then construct 
an empirical CDF for each of the two samples of ranked data. Hajek (1969) has 
demonstrated that in either case, the same inferences can be made. Thus, the 
Kolmogorov-Smirnov test is classified as a rank test. However, for the purpose of 
actually computing the empirical CDF’s and deriving test statistics from them, it is often 
more convenient to work directly with raw data instead of first converting them into 
ranks (or mid-ranks, in the case of ties). Accordingly, let u be the actually observed one- 
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way layout of data, depicted in Table 6.2, and let w, the corresponding one-way layout 
of scores, also be uw. Thus, the entries in Table 6.3 are the original u;;’s. Now let 
(U1) S M21) S++» SUjn,1,) denote the observations from the first sample sorted in 
ascending order, and let (w), < uy) S ... Su, ) denote the observations from the second 
sample, sorted in ascending order. These sorted observations are often referred to as the 
order statistics of the sample. The empirical CDF for each distribution is computed from 
its order statistics. Before doing this, some additional notation is needed to account for 
the possibility of tied observations. Among the 1; order statistics in the jth sample, 
j = 1,2, let there be g,<n, distinct order statistics, with e,, observations all tied for 
first place, e,, observations all tied for second place, and so on until finally, e, ij 
observations are all tied for last place. Obviously, e,;+¢e);+...+e,; = n;. Let 
(u* |; <U*o)<... <u*g;) represent the g, distinct order statistics of sample j = 1,2. 
You can now compute the empirical CDF’s, F) for F', and F2 for F’,, as shown below. 
For j = 1, 2, define 


0 ifu<u* iy] 
Fj(u) = (eri + ela t rd + e4j)/N; if Up SUSUg 4 1, for k = 1, 2, 9 Bj 1 
1 . 
ifuzur yj 


The test statistic for testing the null hypothesis (see Equation 6.2) against the two-sided 
alternative hypothesis (see Equation 6.23) is the Kolmogorov-Smimov Z and is defined as 


Z = T(,fnyno/ (ny +7)) Equation 6.24 


where TJ is defined as 


F\(v) -Fx(v)]] Equation 6.25 


T = max [ 
v 


and the observed value of T is denoted by ¢. The exact two-sided p value for testing 
Equation 6.2 against Equation 6.23 is 


Py = Pr(T2 bt) Equation 6.26 
When the exact p value is too difficult to compute, you can resort to Monte Carlo sam- 
pling. The Monte Carlo estimate of p, is denoted by p, . It is computed as shown below: 


1. Generate a new one-way layout of scores by permuting the original layout of raw 
data, u, in one of the N!/(n,!n,!) equally likely ways. 


2. Compute the value of the test statistic 7 for the permuted one-way layout. 
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3. Define the random variable 


Z= 1 iffet Equation 6.27 
0 otherwise 


Repeat the above steps a total of M times to generate the realizations (Z,,25,...Z,,) for 
the random variable Z. Then an unbiased estimate of p, is 


M 
- Dy Si 
2S Equation 6.28 
Z M 
Next, let 
M 1/2 
S = 774 »; (zy =P) Equation 6.29 


l=1 


be the sample standard deviation of the z,’s. Then a 99% confidence interval for the 
exact p value is 


CI = p £2.5766//M Equation 6,30 
2 


A technical difficulty arises when either P, = 0 or P, = 1. Now the sample standard 
deviation is 0, but the data do not support a confidence interval of zero width. An 
alternative way to compute a confidence interval that does not depend on o is based on 
inverting an exact binomial hypothesis test when an extreme outcome is encountered. It 
can be easily shown that if P, = 0,an o&% confidence interval for the exact p value is 


CI = [0,1 -(1-a/100)!/] Equation 6.31 
Similarly, when P, = 1, an o% confidence interval for the exact p value is 

CI = [(1-0/100)!/”, 1] Equation 6.32 
Exact Tests uses default values of M=10000 and «=99%. While these defaults can be 


easily changed, they provide quick and accurate estimates of exact p values for a wide 
range of data sets. 
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The asymptotic two-sided p value, P, , is based on the following limit theorem: 


lim Pr(,/nyny/(n, +15)TSz) = Dey ele nes Equation 6.33 


nyNna© 
pe i=l 


Although the right side of Equation 6.33 has an infinite number of terms, in practice you 
need to compute only the first few terms of the above expression before convergence is 
achieved. 


Example: Effectiveness of Vitamin C 


These data are taken from Lehmann (1975). The effectiveness of vitamin C in orange 
juice and synthetic ascorbic acid was compared in 20 guinea pigs (divided at random 
into two groups). Figure 6.5 shows the data displayed in the Date Editor. There are two 
variables in these data—score represents the results, in terms of length of odontoblasts 
(rounded to the nearest integer) after six weeks; source indicates the source of the vita- 
min C, either orange juice or ascorbic acid. 


Figure 6.5 Effectiveness of vitamin C in orange juice and ascorbic acid 


The results of the two-sample Kolmogorov-Smirnov test for these data are shown in 


Figure 6.6. 


score source 
8| Orange Juice 11 4| Ascorbic Acid 
8| Orange Juice 5| Ascorbic Acid 
10| Orange Juice 6| Ascorbic Acid 
10| Orange Juice 6| Ascorbic Acid 
10| Orange Juice 7| Ascorbic Acid 
15] Orange Juice 7| Ascorbic Acid 
15] Orange Juice 10| Ascorbic Acid 
16| Orange Juice 11] Ascorbic Acid 
18} Orange Juice 11] Ascorbic Acid 
22] Orange Juice 12| Ascorbic Acid 
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Figure 6.6 Two-sample Kolmogorov-Smirnov results for orange juice and ascorbic acid 
data 


Frequencies 


Source of | Orange 
Vitamin C | Juice 


Ascorbic 
Acid 


Total 


Test Statistics! 


Score 
Most Extreme Differences Absolute .600 
Positive .000 
Negative -.600 
Kolmogorov-Smirnov Z 1.342 
Asymp. Sig. (2-tailed) .055 
Exact Significance (2-tailed) 045 
Point Probability 043 


1. Grouping Variable: Source of Vitamin C 


The exact two-sided p value is 0.045. This demonstrates that, despite the small sample 
size, there is a statistically significant difference between the two forms of vitamin C 
administration. The corresponding asymptotic p value equals 0.055, which is not 
statistically significant. It has been demonstrated in several independent studies (see, for 
example, Goodman, 1954) that the asymptotic result is conservative. This is borne out 
in the present example. 


Wald-Wolfowitz Runs Test 


The Wald-Wolfowitz runs test is a competitor to the Kolmogorov-Smirnov test for 
testing the null hypothesis 


Hy: F\(v) = F,(v) for all v Equation 6.34 
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against the alternative hypothesis 


H,: F\(v) #F,(v) for at least one v Equation 6.35 


The test is completely general, in the sense that no distributional assumptions need to be 
made about F', and F’,. Thus, it is referred to as an omnibus, or distribution-free, test. 

Suppose the data consist of the one-way layout displayed as Table 6.2. The Wald- 
Wolfowitz test statistic is computed in the following steps: 


1. Sort all N = n, +n, observations in ascending order, and position them in a single 
row represented as (a, ,) Sa); S... Say). 


2. Replace each observation in the above row with the sample identifier 1 if it came 
from the first sample and 2 if it came from the second sample. 


3. A run is defined as a succession of identical numbers that are followed and preceded 
by a different number or no number at all. The test statistic, T, for the Wald-Wolfowitz 
test is the number of runs in the above row of 1’s and 2’s. 


Under the null hypothesis, you expect the sorted list of observations to be well mixed 
with respect to the sample 1 and sample 2 identifiers. In that case, you will see a large 
number of runs. On the other hand, if observations from F', tend to be smaller than those 
from F,, you expect the sorted list to lead with the sample 1 observations and be 
followed by the sample 2 observations. In the extreme case, there will be only two runs. 
Likewise, if the observations from F’, tend to be smaller than those from F’, , you expect 
the sorted list to lead with the sample 2 observations and be followed by the sample 1 
observations. Again, in the extreme case, there will be only two runs. These 
considerations imply that the p value for testing Hy against the omnibus alternative H, 
should be the left tail of the random variable, T, at the observed number of runs, t. That 
is, the exact p value is given by 


P, = Pr(Tsb) Equation 6.36 


The distribution of T is obtained by permuting the observed one-way layout in all 
possible ways and assigning the probability (see Equation 6.4) to each permutation. You 
can also derive this distribution theoretically using the same reasoning that was used in 
“Runs Test” on p. 53 in Chapter 4; the Monte Carlo p value, p, , and the asymptotic p 
value, p,, can be obtained similarly, using the results described in this section. 


Example: Discrimination against Female Clerical Workers 


The following example uses a subset of data published by Gastwirth (1991). In 
November, 1983, a female employee of Shelby County Criminal Court filed a charge of 
discrimination in pay between similarly qualified male and female clerical workers. 
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Figure 6.7 shows the data displayed in the Data Editor. Salary represents the starting 
salaries of nine court employees hired between 1975 and 1979, and gender indicates the 
gender of the employee. 


Figure 6.7 Starting monthly salaries (in dollars) of nine court clerical workers 


salary gender 
525| Female 
500] Female 
550] Female 
576| Female 
458| Female 
600| Female 
700 Male 
886 Male 
600 Male 


A quick visual inspection of these data reveals that in no case was a female paid a higher 
starting salary than a male hired for a comparable position. Consider these data to clarify 
how the Wald-Wolfowitz statistic is obtained. 

The table below consists of two rows. The first row contains the nine observations 
sorted in ascending order. The second row contains the sample identifier for each obser- 
vation: | if female and 2 if male. 


458 500 525 550 576 600 600 700 886 


By the above definition, there are only two runs in these data. Notice, however, that there 
is a tie in the data. One observation from the first sample and one from the second sam- 
ple are both tied with a value of 600. Therefore, you could also represent the succession 
of observations and their sample identifiers as shown below. 


458 500 525 550 576 600 600 700 886 


Now there are four runs in the above succession of sample identifiers. First, there is a 
tun of five 1’s. Then a run of a single 2, followed by arun of a single 1. Finally, there is 
arun of two 2’s. 

The liberal value of the Wald-Wolfowitz test statistic is the one yielding the smallest 
number of runs after rearranging the ties in all possible ways. This is denoted by ¢,,;, - 
The conservative value of the Wald-Wolfowitz test statistic is the one yielding the largest 


Two-Sample Inference: Independent Samples 97 


number of runs after rearranging the ties in all possible ways. This is denoted by 1,,,, - 
Exact Tests produces two p values, 


Pi,min = Pr(Vst Equation 6.37 


ini 
and 


Pi, max = Pr(T St Equation 6.38 


max ) 


Conservative decisions are usually made with p, ,,,, - For the clerical workers data set, 
the output of the Wald-Wolfowitz test is shown in Figure 6.8. 


Figure 6.8 Wald-Wolfowitz runs test for clerical workers data 


Frequencies 


Starting Gender of | Male 
Monthly — | Worker 
Female 
Salary 
Total 
Test Statistics! 
Number Exact Sig. Point 
of Runs Z (1-tailed) | Probability 
Starting Minimum 3 
Monthly | Possible 2 -2.041 024 024 
Salary ; 7 
Nene 4 -.408 345 238 
Possible 


!. Wald-Wolfowitz Test 
2. Grouping Variable: Gender of Worker 
3. There are 


inter-group ties involving 2 cases. 


When ties are broken in all possible ways, the minimum number of runs is 2, and the 
maximum is 4. The smallest possible exact p value is thus py i, = 0.024. The largest 
possible exact p value is Py inax = 9.345. In the interest of being as conservative as 
possible, this is clearly the one to report. It implies that you cannot reject the null 
hypothesis that F, = F,. 


Median Test 


The two-sample version of the median test is identical in every respect to the k-sample 
version discussed in Chapter 8. Please refer to the discussion of the median test in 
Chapter 8 and substitute K = 2 if there are only two samples. 
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K-Sample Inference: 
Related Samples 


This chapter discusses tests based on K related samples, each of size N. It is a 

generalization of the paired-sample problem described in Chapter 5. The data consist of 

N independent K x 1 vectors or blocks of observations in which there is dependence 

among the K components of each block. The dependence can arise in various ways. Here 

are a few examples: 

e There are K repeated measurements on each of N subjects, possibly at different time 
points, once after each of K treatments has been applied to the subject. 

e There are K subjects within each of N independent matched sets of data, where the 
matching is based on demographic, social, medical or other factors that are a priori 
known to influence response and are not, therefore, under investigation. 


There are K distinct judges, all evaluating the same set of NV applicants and assigning 
ordinal scores to them. 


Many other possibilities exist for generating K related samples of data. In all of these 
settings, the objective is to determine if the K populations from which the data arose 
are the same. Tests of this hypothesis are often referred to as blocked comparisons to 
emphasize that the data consist of N independent blocks with K dependent observations 
within each block. Exact Tests provides three tests for this problem: Friedman’s, Co- 
chran’s Q, and Kendall’s W, also known as Kendall’s coefficient of concordance. 


Available Tests 


Table 7.1 shows the available tests for related samples, the procedure from which they 
can be obtained, and a bibliographical reference for each test. 
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Table 7.1 


Test 


Friedman’s test 
Kendall’s W test 


Cochran’s Q test 


When to Use Each Test 


Friedman’s test. Use this test to compare K related samples of data. Each observation 
consists of a 1 x K vector of correlated values, and there are N such observations, thus 


Available tests 


Procedure 
Nonparametric Tests: 
Tests for Several Related Samples 


Nonparametric Tests: 
Tests for Several Related Samples 


Nonparametric Tests: 
Tests for Several Related Samples 


forming an N x K two-way layout. 


Kendall’s W test. This test is completely equivalent to Friedman’s test. The only 
advantage of this test over Friedman’s is that Kendall’s W has an interpretation as the 
coefficient of concordance, a popular measure of association. (See also Chapter 14). 


Cochran’s Q test. This test is identical to Friedman’s test but is applicable only to the 


special case where the responses are all binary. 


Statistical Methods 


The observed data for all of the tests in this chapter are represented in the form of a two- 


way layout, shown in Table 7.2. 


Table 7.2 Two-way layout for K related samples 


Block Treatments 
Id 1 k 
1 Uy, Uy. ws Up 
2 Uy, Uy, Udg 
N Uy, Un? UNK 


Reference 
Lehmann (1975) 


Conover (1975) 


Lehmann (1975) 
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This layout consists of V independent blocks of data with K correlated observations within 
each block. The data are usually continuous (possibly with ties). However, for the 
Cochran’s Q test, the data are binary. Various test statistics can be defined on this two-way 
layout. Usually, however, these test statistics are defined on ranked data rather than on the 
original raw data. Accordingly, first replace the K observations, (u;), Uj, -..U;x) in block 
i with corresponding ranks, (7), 7j9, ...;x) . If there were no ties among these u;;s , you 
would assign the first K integers (1, 2, ...K) , not necessarily in order, as the ranks of 
these K observations. If there are ties, you would assign the average rank or mid-rank to 
the tied observations. Specifically, suppose that the K observations of the first block take 
on e, distinct values, with d,, of the observations being equal to the smallest value, d,, 

to the next smallest, d,, to the third smallest, and so on. Similarly, the K observations in 
the second block take on e, distinct values, with d,, of the observations being equal to 
the smallest value, d,, to the next smallest, d,, to the third smallest, and so on. Finally, 
the K observations in the Nth block take on ey distinct values, with dy, of the 
observations being equal to the smallest value, dy. to the next smallest, d,, to the third 
smallest, and so on. It is now possible to define the mid-ranks precisely. For 
i = 1,2,...N, the e, distinct mid-ranks in the ith block, sorted in ascending order, are 


ok 
ri 


(d,,+1)/2 


ok 
Bid 


d.,+(d,,+1)/2 
i * (in ) Equation 7.1 


Wiei = Ayztdgyt+...+(d; 9+ 1)/2 


i, ei 
You can now replace the original observations, (uj), Uj, ...U;x) , in the ith block with 
corresponding mid-ranks, (7;;, 7;)-..r;x) , where each r;; 1s the appropriate selection 
from the set of distinct mid-ranks (7*;; <1r*j) <...<7*;,.;). The modified two-way 
layout is shown in Table 7.3. 


Table 7.3 Two-way layout for mid-ranks for K related samples 


Block Treatments 
Id 1 2 a K 
Lory Tp Nk 
2  1y, Ma MK 
N ry Tp 'NK 
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As an example, suppose that K = 5, there are two blocks, and the two-way layout of the 
raw data (the u;;’s) is as shown in Table 7.4. 


Table 7.4 Two-way layout with two blocks of raw data 


Block Treatments 
ID 1 2 3 4 5 


1 1.3 1.1 1.1 1.6 1.1 
2 1.9 1.7 1.9 1.9 1.7 


For the first block, e; = 3, with d,, = 3, d,, = 1, d,3 = 1. Using Equation 7.1, you 
can obtain mid-ranks r*,; = 2, r*,;. = 4, and r*,3 = 5. For the second block, 
e, = 2, with d,, = 2, dy, = 3. Thus, you obtain mid-ranks r*,,; = 1.5 and 
r*o, = 4. You can now use these mid-ranks to replace the original u,, values with 
corresponding r;, values. The modified two-way layout, in which raw data have been 
replaced by mid-ranks, is displayed as Table 7.5. 


Table 7.5 Sample two-way layout with raw data replaced by mid-ranks 


Block Treatments 
ID 1 2 3 4 5 
1 4 2 2 5 2 


2 4 1.5 4 4 1.5 


All of the tests discussed in this chapter are based on test statistics that are functions of 
the two-way layout of mid-ranks displayed in Table 7.3. Before specifying these test 
statistics, define the rank-sum for any treatment j as 


r= Dy rij Equation 7.2 


the average rank-sum for treatment j as 


r. 
r.=4 Equation 7.3 
o> aN 


and the average rank-sum across all treatments as 


Equation 7.4 
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The test statistics for Friedman’s, Kendall’s W, and Cochran’s Q tests, respectively, are 
all functions of r;;, 7 ;, and r_. The functional form for each test differs, and is defined 
later in this chapter in the specific section that deals with the test. However, regardless 
of its functional form, the exact probability distribution of each test statistic is obtained 
by the same permutation argument. This argument and the corresponding definitions of 
the one- and two-sided p values are given below. 

Let T denote the test statistic for any of the tests in this chapter, and test the null 


hypothesis 


H,: There is no difference in the K treatments Equation 7.5 


If Ho is true, the K mid-ranks, (7;), 72, ...;x) , belonging to block i could have been 
obtained in any order. That is, any treatment could have produced any mid-rank, and 
there are K! equally likely ways to assign the K mid-ranks to the K treatments. If you 
apply the same permutation argument to each of the N blocks, there are (K!) equally 
likely ways to permute the observed mid-ranks such that the permutations are only 
carried out within each block but never across the different blocks. That is, there are 
(K a equally likely permutations of the original two-way layout of mid-ranks, where 
only intra-block permutations are allowed. Each of these permutations thus has a 
(K! ie probability of being realized and leads to a specific value of the test statistic. The 
exact probability distribution of T can be evaluated by enumerating all of the 
permutations of the original two-way layout of mid-ranks. If ¢ denotes the observed 
value of 7 in the original two-way layout, then 


P(T=) = > (kK!) Equation 7.6 
T=t 

the sum being taken over all possible permutations of the original two-way layout of 

mid-ranks which are such that T = t. The probability distribution (see Equation 7.6) and 


its tail areas are obtained in Exact Tests by fast numerical algorithms. The exact two- 
sided p value is defined as 


p, = Pr(T22) = yy” Equation 7.7 


T2t 


When Equation 7.7 is too difficult to obtain by exact methods, it can be estimated by 
Monte Carlo sampling, as shown in the following steps: 


1. Generate a new two-way layout of mid-ranks by permuting each of the N blocks of 
the original two-way layout of mid-ranks (see Table 7.3) in one of K! equally likely 
ways. 


© Copyright IBM Corporation. 1989, 2012 


104 Chapter 7 


2. Compute the value of the test statistic T for the new two-way layout. Define the ran- 
dom variable 


Z= lifT2t Equation 7.8 
0 otherwise 


3. Repeat steps 1 and 2 a total of M times to generate the realizations (Zz), Z>, ...Z,4) for 
the random variable Z. Then an unbiased estimate of p, is 


M 
Zz 
“ Ds — 
p =—— Equation 7.9 
2 M 
Next, let 
M 1/2 
6 = van > (2;-p,)° Equation 7.10 


be the sample standard deviation of the z,’s. Then a 99% confidence interval for the ex- 
act p value is: 


CI = p,+2.5766/ JM Equation 7.11 


A technical difficulty arises when either P, = 0 or P, = 1. Now the sample standard 
deviation is 0, but the data do not support a confidence interval of zero width. An 
alternative way to compute a confidence interval that does not depend on o is based on 
inverting an exact binomial hypothesis test when an extreme outcome is encountered. It 
can be easily shown that if p, = 0, an a % confidence interval for the exact p value is 


CI = [0,1-(1-@/100)!/"] Equation 7.12 
Similarly, when Pp os 1, an & % confidence interval for the exact p value is 

CI = [(1-o@/100)!/% 1] Equation 7.13 
Exact Tests uses default values of M= 10000 and a =99%. While these defaults can be 


easily changed, they provide quick and accurate estimates of exact p values for a wide 
range of data sets. 


K-Sample Inference: Related Samples 105 


The asymptotic p value is obtained by noting that the large-sample distribution of T is 
chi-square with K —1 degrees of freedom. Thus, the asymptotic two-sided p value is 


mu 2 ‘ 
y= X K-12t Equation 7.14 


One-sided p values are inappropriate for the tests in this chapter, since they all assume 
that there is no a priori natural ordering of the K treatments under the alternative 
hypothesis. Thus, large observed values of T are indicative of a departure from Ho but 
not of the direction of the departure. 


Friedman’s Test 


The methods discussed in this and succeeding sections of this chapter apply to both the 
randomization and population models for generating the data. If you assume that the 
assignment of the treatments to the K subjects within each block is random (the 
randomized block design), you need make no further assumptions concerning any 
particular population model for generating the u,,’s. This is the approach taken by 
Lehmann (1975). However, sometimes it is useful to specify a population model, since 
it allows you to define the null and alternative hypotheses precisely. Accordingly, 
following Hollander and Wolfe (1973), you can take the model generating the original 
two-way layout (see Table 7.2) to be 


Ui; = Ut B; + T; +€ Equation 7.15 


ij 


for i = 1,2,...N, and j = 1,2,...K, where wu is the overall mean, B, is the block 
effect, tT, is the treatment effect, and the €;,’s are identically distributed unobservable 
error terms from an unknown distribution, with a mean of 0. All of these parameters are 
unknown, but for identifiability you can assume that 


N K 
a a 
i=l j=l 


Note that U;; is a random variable, whereas u;; is the specific value assumed by it in 
the data set under consideration. The null hypothesis that there is no treatment effect 
may be formally stated as 


Ho: T) = 1) =. = Tx Equation 7.16 
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Friedman’s test has good power against the alternative hypothesis 


Hy: T, #4, for at least one(j,,j,) pair Equation 7.17 
Notice that this alternative hypothesis is an omnibus one. It does not specify any ordering 
of the treatments in terms of increases in response levels. The alternative to the null 
hypothesis is simply that the treatments are different, not that one specific treatment is 
more effective than another. 

Friedman’s test uses the following test statistic, defined on the two-way layout of 
mid-ranks shown in Table 7.3. 


K 
Ly (r,-Nr )? 
Pele a Equati 
1 quation 7.18 


NK(K +1) -(K- Daye ue by : 4K 


The exact, Monte Carlo and asymptotic two-sided p values based on this statistic are ob- 
tained by Equation 7.7, Equation 7.9, and Equation 7.14, respectively. 


Example: Effect of Hypnosis on Skin Potential 


This example is based on an actual study (Lehmann, 1975). However, the original data 
have been altered to illustrate the importance of exact inference for data characterized 
by a small number of blocks but a large block size. In this study, hypnosis was used to 
elicit (in a random order) the emotions of fear, happiness, depression, calmness, and 
agitation from each of three subjects. Figure 7.1 shows these data displayed in the Data 
Editor. Subject identifies the subject, and fear, happy, depress, calmness, and agitate give 
the subjects’s skin measurements (adjusted for initial level) in millivolts for each of the 
emotions studied. 


Figure 7.1 Effect of hypnosis on skin potential 


subject 
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Do the five types of hypnotic treatments result in different skin measurements? The data 
seem to suggest that this is the case, but there were only three subjects in the sample. 
Friedman’s test can be used to test this hypothesis accurately. The results are displayed 
in Figure 7.2. 


Figure 7.2 Friedman’s test results for hypnosis data 


Ranks 

Mean 

Rank 
FEAR 3.00 
Happiness 5.00 
Depression 1.50 
Calmness 2.00 
Agitation 3.50 


Test Statistics! 


N 3 
Chi-Square 9.153 
df 4 
Asymp. Sig. .057 
Exact Sig. .027 
= apap ae 


1. Friedman Test 


The exact two-sided p value is 0.027 and suggests that the five types of hypnosis are sig- 
nificantly different in their effects on skin potential. The asymptotic two-sided p value, 
0.057, is double the exact two-sided p value and does not show statistical significance at 
the 5% level. 

Because this data set is small, the exact computations can be executed quickly. For a 
larger data set, the Monte Carlo estimate of the exact p value is useful. Figure 7.3 dis- 
plays the results of a Monte Carlo analysis on the same data set, based on generating 
10,000 permutations of the original two-way layout. 
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Figure 7.3 


Monte Carlo results for hypnosis data 


Ranks 


FEAR 


Happiness 5.00 
Depression 1.50 
Calmness 2.00 
Agitation 3.50 
Test Statistics! 
Monte Carlo Sig. 
99% Confidence 
Interval 
Asymp. Lower Upper 
N Chi-Square df Sig. Sig. Bound Bound 
3 9.153 4 057 .027 023 032 


1. Friedman Test 


Notice that the Monte Carlo point estimate of 0.027 is much closer to the true p value 
than the asymptotic p value. In addition, the Monte Carlo technique guarantees with 
99% confidence that the true p value is contained within the range (0.023, 0.032). This 
confirms the results of the exact inference, that the differences in the five modes of hyp- 
nosis are statistically significant. The asymptotic analysis failed to demonstrate this result. 


Kendall’s W 


Kendall’s W, or coefficient of concordance, was actually developed as a measure of 
association, with the N blocks representing N independent judges, each one assigning 
ranks to the same set of K applicants (Kendall and Babington-Smith, 1939). Kendall’s 
W measures the extent to which the N judges agree on their rankings of the K applicants. 
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Kendall’s W bears a close relationship to Friedman’s test; Kendall’s W is in fact a 
scaled version of Friedman’s test statistic: 


Tr 


= Equation 7.19 
N(K - 1) 

The exact permutation distribution of W is identical to that of 7, and tests based on ei- 
ther W or T,, produce identical p values. The scaling ensures that W = 1 ifthere is per- 
fect agreement among the N judges in terms of how they rank the K applicants. On the 
other hand, if there is perfect disagreement among the NV judges, W = 0. The fact that 
the judges don’t agree implies that they don’t rank the K applicants in the same order. 
So each applicant will fare well at the hands of some judges and poorly at the hands of 
others. Under perfect disagreement, each applicant will fare the same overall and will 
thereby produce an identical value for R ;. This common value of R ; will be R_, and 
as a consequence, W = 0. 


Example: Attendance at an Annual Meeting 


This example is taken from Siegel and Castellan (1988). The Society for Cross-Cultural 
Research (SCCR), decided to conduct a survey of its membership on factors influencing 
attendance at its annual meeting. A sample of the membership was asked to rank eight 
factors that might influence attendance. The factors, or variables, were airfare, climate, 
season, people, program, publicity, present, and interest. Figure 7.4 displays the data in the 
Data Editor and shows how three members (raters 4, 21, and 11) ranked the eight vari- 
ables. 


Figure 7.4 Rating of factors affecting decision to attend meeting 


climate season people program publicty present interest 


To test the null hypothesis that Kendall’s coefficient of concordance is 0, out of the eight 
possible ranks, each rater (judge) assigns a random rank to each factor (applicant). The 
results are shown in Figure 7.5. 
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Figure 7.5 Results of Kendall’s W for data on factors affecting decision to attend meeting 


Ranks 
Mean 
Rank 
AIRFARE 3.33 
CLIMATE 6.00 
SEASON 4.67 
PEOPLE 2.00 
PROGRAM 2.33 
PUBLICTY 5.33 
PRESENT 4.33 
INTEREST 8.00 
Test Statistics! 
Monte Carlo Sig. 
99% Confidence 
Interval 
Kendall's Asymp. Lower Upper 
N w' Chi-Square df Sig. Sig. Bound Bound 
3 656 13.778 7 055 0222 018 026 


1. Kendall's Coefficient of Concordance 
2. Based on 10000 sampled tables with starting seed 2000000. 


The point estimate of the coefficient of concordance is 0.656. The asymptotic p value of 
0.055 suggests that you cannot reject the null hypothesis that the coefficient is 0. How- 
ever, because of the small sample size (only 3 raters), this conclusion should be verified 
with an exact test, or you can rely on a Monte Carlo estimate of the exact p value, based 
on 10,000 random permutations of the original two-way layout of mid-ranks. The Monte 
Carlo estimate is 0.022, less than half of the asymptotic p value, and is strongly sugges- 
tive that the coefficient of concordance is not 0. The 99% confidence interval for the ex- 
act p value is (0.022, 0.026). It confirms that you can reject the null hypothesis that there 
is no association at the 5% significance level, since you are 99% assured that the exact 
p Value is no larger than 0.026. 

Equation 7.19 implies that Friedman’s test and Kendall’s W test will yield identical 
p values. This can be verified by running Friedman’s test on the data shown in Figure 
7.4. Figure 7.6 shows the asymptotic and Monte Carlo p values for Friedman’s test and 
demonstrates that they are the same as those obtained with Kendall’s W test. The Monte 
Carlo equivalence was achieved by using the same starting seed and the same number 
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of Monte Carlo samples for both tests. Ifa different starting seed had been used, the two 
Monte Carlo estimates of the exact p value would have been slightly different. 


Figure 7.6 Friedman’s test results for data on factors affecting decision to attend meeting 


Test Statistics! 


Monte Carlo Sig. 


99% Confidence 
Interval 
Asymp. Lower Upper 
N Chi-Square df Sig. Sig. Bound Bound 
3 13.778 7 .055 .022 .018 .026 


1. Friedman Test 


Example: Relationship of Kendall’s W to Spearman’s R 


In Chapter 14, a different measure of association known as Spearman’s rank-order 
correlation coefficient is discussed. That measure is applicable only if there are N = 2 
judges, each ranking K applicants. Could this measure be extended if N exceeded 2? One 
approach might be to form N!/(2!(N—2)!) distinct pairs of judges. Then each pair 
would yield a value for Spearman’s rank-order correlation coefficient. Let ave(R,) 

denote the average of all these Spearman correlation coefficients. If there are no ties in 
the data you can show (Conover, 1980) that 


NW-1 


ave(Ry) = Equation 7.20 


Thus, the average Spearman rank-order correlation coefficient is linearly related to 
Kendall’s coefficient of concordance, and you have a natural way of extending the 
concept correlation from a measure of association between two judges to one between 
several judges. 

This can be illustrated with the data in Figure 7.4. As already observed, Kendall’s W 
for these data is 0.656. Using the procedure discussed in “Spearman’s Rank-Order 
Correlation Coefficient” on p. 178 in Chapter 14, you can compute Spearman’s correla- 
tion coefficient for all possible pairs of raters. The Spearman correlation coefficient 
between rater 4 and rater 21 is 0.7381. Between rater 4 and rater 11, it is 0.2857. Finally, 
between rater 21 and rater 11, it is 0.4286. Therefore, the average of the three Spearman 
correlation coefficients is (0.7381 + 0.2857 +0.4286)/3 = 0.4841. Substituting 
N = 3 and W = 0.6561 into Equation 7.20, you also get 0.4841. 


© Copyright IBM Corporation. 1989, 2012 


112 Chapter 7 


Cochran’s Q Test 


Suppose that the u,, values in the two-way layout shown in Table 7.2 were all binary, 
with a | denoting success and a 0 denoting failure. A popular mathematical model for 
generating such binary data in the context of the two-way layout is the logistic regres- 
sion model 


Tl. 
log — = u+B; +7, Equation 7.21 

1-1; 
Ly 


where, for all i = 1,2,...N, and j = 1,2,...K, Ti = Pr(U;; = 1), uw is the back- 
ground log-odds of response, 8; is the block effect, and T, is the treatment effect. All of 
these parameters are unknown, but for identifiability you can assume that 


N K 
aa a 
i=l j=l 
Friedman’s test applied to such data is known as Cochran’s Q test. As before, the null 


hypothesis that there is no treatment effect can be formally stated as 


Ho: (1, =1, =... = Tx) Equation 7.22 


Cochran’s Q test is used to test Hy against unordered alternatives of the form 


H,: 1, #4, for at least one (j,,/) pair Equation 7.23 


Like Friedman’s test, Cochran’s Q is an omnibus test. The alternative hypothesis is sim- 
ply that the treatments are different, not that one specific treatment is more effective than 
another. You can use the same test statistic as for Friedman’s test. Because of the binary 
observations, the test statistic reduces to 


Q= eS Equation 7.24 


where B, is the total number of successes in the jth treatment, L; is the total number of 
successes in the ith block, and B denotes the average (B, + By +...+By)/K. The 
asymptotic distribution of Q is chi-square with (K — 1) degrees of freedom. The exact 


K-Sample Inference: Related Samples 113 


and Monte Carlo results are calculated using the same permutational arguments used for 
Friedman’s test. The exact, Monte Carlo and asymptotic two-sided p values are thus 
obtained by Equation 7.7, Equation 7.9, and Equation 7.14, respectively. 


Example: Crossover Clinical Trial of Analgesic Efficacy 


This data set is taken from a three-treatment, three-period crossover clinical trial pub- 
lished by Snapinn and Small (1986). Twelve subjects each received, in random order, 
three treatments for pain relief: a placebo, an aspirin, and an experimental drug. The out- 
come of treatment j on subject i is denoted as either a success (u;; = 1) or a failure 
(uj; = 0). Figure 7.7 shows the data displayed in the Data Editor. 


Figure 7.7 Crossover clinical trial of analgesic efficacy 


id placebo aspirin drug 
1 Failure Success Success 
2 Failure Success Success 
3 Success Failure Success 
4 Failure Failure Failure 
5 Failure Failure Success 
6 Failure Success Success 
7 Success Failure Success 
8 Failure Failure Success 
9 Failure Failure Failure 
10 Failure Failure Success 
11 Failure Success Failure 
12 Failure Failure Success 
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The Cochran’s Q test can be used to determine if the response rates for the three treat- 
ments differ. The results are displayed in Figure 7.8. 


Figure 7.8 Cochran’s Q results for study of analgesic efficacy 


Frequencies 


Value 
Placebo 10 
Aspirin 8 
New Drug 3 
Test Statistics! 
ae 's Asymp. Point 


a Exact =. Probability 


1. O is treated as a success. 


The exact p value is 0.026 and indicates that the three treatments are indeed significantly 
different at the 5% level. The asymptotic p value, 0.020, confirms this result. In this data 
set, there was very little difference between the exact and the asymptotic inference. 
However, the data set is fairly small, and a slightly different data configuration could have 
resulted in an important difference between the exact and asymptotic p values. To illus- 
trate this point, ignore the data provided by the 12th subject. Running Cochran’s Q test 
once more, this time on only the first 11 subjects, yields the results shown in Figure 7.9. 


Figure 7.9 Cochran’s Q results for reduced analgesic efficacy data 


Frequencies 


Value 
0 | 
Placebo 9 
Aspirin 
New Drug 


Test Statistics! 


Cochran's Asymp. Point 
N Q df Sig. Exact Sig. | Probability 
11 6.2221 2 045 .059 .024 


1. O is treated as a success. 
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This time, the exact p value, 0.059, is not significant at the 5% level, but the asymp- 
totic approximation, 0.045, is. Although not strictly necessary for this small data set, 
you can also run the Monte Carlo test on the first 11 subjects. The results are shown 
in Figure 7.10. 


Figure 7.10 Monte Carlo results for reduced analgesic efficacy data 


Test Statistics 


Monte Carlo Sig. 


99% Confidence 
Interval 
Cochran's Asymp. Lower Upper 
N Q df Sig. Sig. Bound Bound 
1 6.2221 2 .045 .0567 .050 .061 


1. 0 is treated as a success. 
2. Based on 10000 sampled tables with starting seed 2000000. 


The Monte Carlo estimate of the exact p value was obtained by taking 10,000 random 
permutations of the observed two-way layout. As Figure 7.10 shows, the results 
matched those obtained from the exact test. The Monte Carlo sampling demonstrated 
that the exact p value lies in the interval (0.050, 0.061) with 99% confidence. This is 
compatible with the exact results, which also showed that the exact p value exceeds 
0.05. The asymptotic result, on the other hand, erroneously claimed that the p value is 
less than 0.05 and is therefore statistically significant at the 5% level. 
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K-Sample Inference: 
Independent Samples 


This chapter deals with tests based on K independent samples of data drawn from K 
distinct populations. The objective is to test the null hypothesis that the K populations 
all have the same response distributions against the alternative that the response 
distributions are different. The data could also arise from randomized clinical trials in 
which each subject is assigned, according to a prespecified randomization rule, to one 
of K treatments. Here it is not necessary to make any assumptions about the underlying 
populations from which these subjects were drawn, and the goal is simply to test that 
the K treatments are the same in terms of the responses they produce. Lehmann (1975) 
has demonstrated clearly that the same statistical methods are applicable whether the 
data arose from a population model or a randomization model. Thus, no distinction will 
be made between the two ways of gathering the data. 

This chapter generalizes the tests for two independent samples, discussed in Chapter 6, 
to tests for K independent samples. There are two important distinctions between the 
structure of the data in this chapter and in Chapter 7 (the chapter on K related samples). In 
this chapter, the data are independent both within a sample and across samples; in Chapter 
7, the data are correlated across the K samples. Also, in this chapter, the sample sizes can 
differ across the K samples, with n,; being the size of the jth sample; in Chapter 7, the 
sample size, N, is required to be the same for each of the K samples. 


Available Tests 


Table 8.1 shows the available tests for several independent samples, the procedure from 
which they can be obtained, and a bibliographical reference for each test. 
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Table 8.1 Available tests 


Tests Commands References 

Median test Nonparametric Tests: Tests for Several Gibbons (1985 
Independent Samples 

Kruskal-Wallis Test Nonparametric Tests: Tests for Several Siegel & Castellan (1988) 


Independent Samples 


Jonckheere-Terpstra Test Nonparametric Tests: Tests for Several Hollander & Wolfe (1973) 
Independent Samples 


The Kruskal-Wallis and the Jonckheere-Terpstra tests are also discussed in the chapters 
on crosstabulated data. The Kruskal-Wallis test also appears in Chapter 11, which 
discusses singly-ordered r xc contingency tables. The Jonckheere-Terpstra test also 
appears in Chapter 12, which deals with doubly-ordered r xc contingency tables. 
These tests are applicable both to data arising from nonparametric continuous 
univariate-response models (discussed in this chapter) and to data arising from 
categorical-response models such as the multinomial, Poisson, or hypergeometric 
models (discussed in later chapters). The tests in the two settings are completely 
equivalent, although the formulas for the test statistics might differ slightly to reflect the 
different mathematical models giving rise to the data. 


When to Use Each Test 


The tests discussed in this chapter are of two broad types: those appropriate for use 
against unordered alternatives and those for use against ordered alternatives. Following 
a discussion of these two types of tests, each individual test will be presented, along with 
the null and alternative hypotheses. 


Tests Against Unordered Alternatives 


Use the median test or the Kruskal-Wallis test if the alternatives to the null hypothesis 
of equality of the K populations are unordered. The term unordered alternatives means 
that there can be no a priori ordering of the K populations from which the samples were 
drawn, under the alternative hypothesis. As an example, the K populations might 
represent K distinct cities in the United States. Independent samples of individuals are 
taken from each city and some measurable characteristic, say annual income, is selected 
as the response. There is no a priori reason why the cities should be arranged in 
increasing order of the income distributions of their residents, under the alternative 
hypothesis. All you can reasonably say is that the income distributions are unequal. 
For tests against unordered alternatives, the only conclusion you can draw when the 
null hypothesis is rejected is that the K populations do not all have the same probability 
distribution. Therefore, a one-sided p value cannot be defined for testing a specific 
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direction in which the K populations might be ordered under the alternative hypothesis. 
Such tests are said to be inherently two-sided. 


Median test. The median test is useful when you have no idea whatsoever about the al- 
ternative hypothesis. It is an omnibus test for the equality of K distributions, where the 
alternative hypothesis is simply that the distributions are unequal, without any further 
specification as to whether they differ in shape, in location, or both. It uses only infor- 
mation about the magnitude of each of the observations relative to a single number, the 
median for the entire data set. Therefore, it is not as powerful as the other tests consid- 
ered here, most of which use more of the available information by considering the rela- 
tive magnitude of each observation when compared with every other observation. On 
the other hand, it is the most general of the available tests, making no assumptions about 
the alternative hypothesis. 


Kruskal-Wallis test. This is one of the most popular nonparametric tests for comparing K 
independent samples. It is the nonparametric analog of one-way ANOVA. In p value 
calculations, mid-ranks are substituted for the raw data and exact permutational 
distributions are substituted for F distributions derived from normality assumptions. It 
has good power against location-shift alternatives, where the distributions from which 
the samples were drawn have the same general shape but their means are shifted with 
respect to each other. It is about 98% as efficient as one-way ANOVA for comparing K 
samples when the underlying populations are normal and have a common variance. 


Tests Against Ordered Alternatives 


Use the Jonckheere-Terpstra test if the alternatives to the null hypothesis of equality of the 
K populations are ordered. The term ordered alternatives means that there is a natural a 
priori ordering of the K populations from which the samples were drawn, under the 
alternative hypothesis. For example, the K populations might represent K progressively 
increasing doses of some drug. Here the null hypothesis is that the different dose levels all 
produce the same response distributions; the alternative hypothesis is that there is a dose- 
response relationship in which increases in drug dose lead to increases in the magnitude of 
the response. In this setting, there is indeed an a priori natural ordering of the K populations 
in terms of increased dose levels of the drug. One of the implications of natural ordering 
under the alternative hypothesis is that the ordering could be either ascending or 
descending. For the dose-response example, you could define a one-sided p value for 
testing the null hypothesis against the alternative that an increase in drug dose increases 
the probability of response. But you could also define a one-sided p value against the 
alternative that it leads to a decrease in the probability of response. A two-sided p value 
could be defined to test the null hypothesis against either alternative. Thus, for tests against 
ordered alternatives, both one- and two-sided p values are relevant. 
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Statistical Methods 


The data for all the tests in this chapter consist of K independent samples each of size 
nj; j= 1,2,...K, where n; +,+...1x = N. These N observations can be represented 
in the form of the one-way layout shown in Table 8.2. 


Table 8.2. One-way layout for K independent samples 


Samples 
1 2 wee K 
Uy, Uj2 — UK 
Uy, Uy> eee UxK 
Un 2 
Uni UnK 


This table, denoted by u, shows the observed one-way layout of raw data. The observa- 
tions in this one-way layout are independent both within and across columns. The data 
arise from continuous univariate distributions (possibly with ties). Let 


FAV) = Pr(Vs vif), 7 = 1,2,...K Equation 8.1 


denote the distribution from which the n; observations displayed in column of the one- 
way layout were drawn. The goal is to test the null hypothesis 


Ho: Fy= Fy= ..= Fr Equation 8.2 


In order to test Hy by nonparametric methods, it is necessary to replace the original 
observations in the above one-way layout with corresponding scores. These scores 
represent various ways of ranking the data in the pooled sample of size N. Different tests 
utilize different scores, as you will see in the individual sections on each test. Let w,; be 
the score corresponding to w,,. Then the one-way layout, with the original data replaced 
by scores, is shown in Table 8.3. 
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Table 8.3 One-way layout with scores replacing original data 


Samples 
1 2 K 
Wit Wig Wik 
Wo1 = Wo2 20K 
Way2 
Wry 1 Wak 


This table, denoted by w, shows the observed one-way layout of scores. Inference about 
Hy is based on comparing this observed one-way layout to others like it, in which the 
individual w,; elements are the same but occupy different rows and columns. To devel- 
op this idea more precisely, let the set W denote the collection of all possible K-column 
one-way layouts, with n, elements in column /, the members of which include w and all 
its permutations. The random variable w is a permutation of w if it contains precisely 
the same scores as w but with the scores rearranged so that, for at least one (7,/) ,(i',') 
pair, the scores w;, and w,, ; are interchanged. Formally, let 


W = {w:w = w, or w isa permutation of w} Equation 8.3 


In Equation 8.3, you could think of w as a random variable, and w as a specific value 
assumed by it. 

To clarify these concepts, consider a simple numerical example in which the original 
data come from three independent samples of size 5, 3, and 3, respectively. These data 
are displayed in a one-way layout, u, shown in Table 8.4. 


Table 8.4 Example of a one-way layout of original data 


Samples 
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As discussed in “Kruskal-Wallis Test” on p. 131, to run the Kruskal-Wallis test on these 
data, you must replace them with their ranks. The one-way layout of observed scores, 
with the original data replaced by their ranks, is shown in Table 8.5. 


Table 8.5 One-way layout with ranks replacing original data 


Samples 
1 2 3 
3.5 6 9 
1 10 
3.5 11 


This one-way layout of ranks is denoted by w. It is the one actually observed. Notice that 
two observation were tied at 27 in u. Had they been separated by a small amount, they 
would have ranked 3 and 4. But since they are tied, use the mid-rank, (3 + 4)/2 = 3.5, 
as the rank for each of them in w. The symbol W represents the set of all possible one- 
way layouts in which entries are the 11 numbers in w, with 5 numbers in column 1, 3 
numbers in column 2, and 3 numbers in column 3. Thus, w is one member of W. (It is 
the one actually observed.) Another member is w', where w' is a different permutation 
of the numbers in w, as shown in Table 8.6. 


Table 8.6 Permutation of the observed one-way layout of scores 


Sample 
1 2 
6 5 
1 8 10 
3.5 7 11 
3.5 
2 


All of the test statistics in this chapter are univariate functions of w¢ W. Let the test 
statistic be denoted by T(w) = 7, and its observed value be denoted by #(w) =+. The 
functional form of T(w) will be defined separately for each test in subsequent sections 
of this chapter. Following is a discussion of the null distribution of 7—how it can be 
derived in general, and how it is used for p value computations. 
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Distribution of T 


In order to test the null hypothesis, Hy, you need to derive the distribution of 7 under 
the assumption that H is true. This distribution is obtained by the following permuta- 
tional argument: 


If Ho is true, every member w € W has the same probability of being observed. 


Lehmann (1975) has shown that the above permutational argument is valid whether the 
data were gathered independently from K populations or were obtained by assigning NV 
subjects to K treatments in accordance with a predetermined randomization rule. There- 
fore, no distinction will be made between these two ways of gathering the data. 

It follows from the above permutational argument that the exact probability of ob- 
serving any we W is 


K 
IT j=17)! 


hiv) = i 
N! 


Equation 8.4 


which does not depend on the specific way in which the original one-way layout, w, was 
permuted. Then 


Pr(T =) = > h(w) Equation 8.5 
T(w) =t 


the sum being taken over all we W. Similarly, the right tail of the distribution of T is 
obtained as 


Pr(T>?t) = > h(w) Equation 8.6 
T(w)2t 


The probability distribution of 7 and its tail areas are obtained in Exact Tests by 
numerical algorithms. In large samples, you can obtain an asymptotic approximation for 
Equation 8.6. Different approximations apply to the various tests described in this 
chapter and are discussed in the sections specific to each test. 


P Value Calculations 


The p value is the probability, under H , of obtaining a value of the test statistic at least as 
extreme as the one actually observed. The exact, Monte Carlo, and asymptotic p values 
can be computed for tests on K independent samples as follows. 
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Exact P Values 


For all tests against unordered alternatives, the more extreme values of 7 are those that 
are larger than the observed ¢. The exact two-sided p value is then defined as 


Py = Pr(T2 t= SY AGr) Equation 8.7 


T2t 


Since there is no a priori natural ordering of the K treatments under the alternative 
hypothesis, large observed values of T are indicative of a departure from H, but not of 
the direction of the departure. Therefore, it is not possible to define a one-sided p value 
for tests against unordered alternatives. 

For tests against ordered alternatives, such as the Jonckheere-Terpstra test, the test 
statistic T is considered extreme if it is either very large or very small. Large values of 
T indicate a departure from the null hypothesis in one direction, while small values of T 
indicate a departure from the null hypothesis in the opposite direction. Whenever the test 
statistic possesses a directional property of this type, it is possible to define both one- 
and two-sided p values. The exact one-sided p value is defined as 


P, = min{Pr(T2 2), Pr(T<t)} Equation 8.8 


and the exact two-sided p value is defined as 


Py = Pr(|\T-E(T)| S$ |t- E(D))) Equation 8.9 


where F(T) is the expected value of 7. 


Monte Carlo P Values 


When exact p values are too difficult to compute, you can estimate them by Monte Carlo 
sampling. Below, Monte Carlo sampling is used to estimate the exact p value given by 
Equation 8.7. The same procedure can be readily adapted to Equation 8.8 and Equation 8.9. 


1. Generate a new one-way layout of scores by permuting the original layout, w, in one 
of the N!/(n,!n,!...ng!) equally likely ways. 


2. Compute the value of the test statistic T for the permuted one-way layout. 


3. Define the random variable 


LS {i iff2t Equation 8.10 


0 otherwise 
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Repeat the above steps a total of M times to generate the realizations (z,,Z,,...Z,,) for 
the random variable Z. Then an unbiased estimate of p. is 


M 
‘ a act 
P, = —— Equation 8.11 
M 
Next, let 
1/2 
- eee 
o = |— Zy- Equation 8.12 
raED on. ] P,) quation 


be the sample standard deviation of the z/'s . Then a 99% confidence interval for the ex- 
act p value is: 


CI = p,+2.5760/ JM Equation 8.13 


A technical difficulty arises when either p ee 0 or p = 1. Now the sample standard 
deviation is 0, but the data do not support a confidence interval of zero width. An 
alternative way to compute a confidence interval that does not depend ono is based on 
inverting an exact binomial hypothesis test when an extreme outcome is encountered. It 
can be shown that if p = 0, an &% confidence interval for the exact p value is 

CI = [0,1-(1-«/100)!™) Equation 8.14 


Similarly when p = 1, an &% confidence interval for the exact p value is 


CI = [(1-0/100) 1] Bquatien ets 


Exact Tests uses default values of M = 10000 and a = 99%. While these defaults can 
be easily changed, we have found that they provide quick and accurate estimates of 
exact p values for a wide range of data sets. 


Asymptotic P Values 


For tests against unordered alternatives the asymptotic two-sided p value is obtained by 
noting that the large-sample distribution of T is chi-square with K-—1 degrees of 
freedom. The asymptotic p value is thus 


P, = Pr(x-K— 1210) Equation 8.16 
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As noted earlier, one-sided p values are not defined for tests against unordered alternatives. 

For tests against ordered alternatives, in particular for the Jonckheere-Terpstra test, 
the asymptotic distribution of T is normal. The one- and two-sided p values are now 
defined by computing the normal approximations to Equation 8.8 and Equation 8.9, 
respectively. Thus, the asymptotic one-sided exact p value is defined as 


p, = min{@(t- E(T)/o7,1 - O(t-E(T)/o7} Equation 8.17 


and the asymptotic two-sided p value is defined as 


P, = 2p, Equation 8.18 


where ®(z) is the tail area to the left of z from a standard normal distribution, and 0, 
is the standard deviation of 7. Explicit expressions for E(T) and 67, are provided in 
“Jonckheere-Terpstra Test” on p. 135. 


Median Test 


The median test is a nonparametric procedure for testing the null hypothesis Ho, given 
by Equation 8.2, against the general alternative 


H,: There exists at least one (j,,/,) pair such that Fa # Fy Equation 8.19 


The median test is an omnibus test designed for a very general alternative hypothesis. It 
requires no assumptions about the K distributions, F; J= 1,2,...K, being tested. How- 
ever if you have additional information about these distributions—for example, if you 
believe that they have the same shape but differ from one another by shift parameters 
under the alternative hypothesis—there are more powerful tests available. 

To define the test statistic for the median test, the first step is to transform the original 
one-way layout of data, as shown in Table 8.2, into a one-way layout of scores, as shown 
in Table 8.3. To compute these scores, first obtain the grand median, 5, for the pooled 
sample of size N. The median is calculated in the following way. Let 0,1) S$ 0/2]... S$ @py 
be the pooled sample of u;, values, sorted in ascending order. Then 


oO bD/2 if N is odd 
3) “| ee, Equation 8.20 


(O1n/2] + Ory +2)/2)/2 if N is even 
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The score, Wi> corresponding to each Uz is defined as 


i 13 Ss 
56 = ( - i 6 Equation 8.21 
0 ifu;;> 3) 
Define 
nj 
w= »: Wij Equation 8.22 


as the total number of observations in the jth sample that are at or below the median and 


m= Dy Ww; Equation 8.23 


as the total number of observations in the pooled sample that are at or below the median. 

The test statistic for the median test is defined on the 2x K contingency table 
displayed in Table 8.7. The entries in the first row are the counts of the number of 
subjects in each sample whose responses fall at or below the median, while the entries 
in the second row are the counts of the number of subjects whose responses fall above 
the median. 


Table 8.7. Data grouped into a 2 x K contingency table for the median test 


Group ID Samples Row Total 
1 2 ae K 
< Median Ww W es Wx m 
> Median ny-wW Ny— Wy ney ng-Wx N-m 
Column Total 7, Ny ee nk N 


The probability of observing this contingency table under the null hypothesis, 
conditional on fixing the margins, is given by the hypergeometric function 


h(w) = —————— Equation 8.24 


128 Chapter 8 


For any w € W, the test statistic for the median test is the usual Pearson chi-square statistic 


K es 2D K a 2 
—n.m/N -—w.—n(N-m)/ 
r=) DEEN). y Si oa Equation 8.25 
a nym/N i n(N—m)/N 
= = 


Thus, if tis the value of 7 actually observed, the exact two-sided p value for the median 
test is given by 


P2= Yi how) Equation 8.26 


TSt 


the sum being taken over all w ¢ W for which T(w) <¢. An asymptotic approximation 
to p» is obtained by noting that T converges to the chi-square distribution with K — 1 
degrees of freedom. Therefore, 


P= Pr(x?K-1 >t) Equation 8.27 


The Monte Carlo two-sided p value is obtained as described in “P Value Calculations” 
on p. 123. Lanes, you can generate a sequence of M 2x K contingency tables, 


W, SWS ss , each with the same margins as Table 8.7, such that table w, is generated 
with srababili h(w,), given by Equation 8.24. For each table generated in this way, 
you can compute the test statistic, ¢, , and define a quantity z, = 1 if ¢, = ¢; 0 other- 


wise. The Monte Carlo estimate of p, is 
Po = z/M Equation 8.28 


The 99% Monte Carlo confidence interval for the true p value is calculated by 
Equation 8.13. 
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Example: Hematologic Toxicity Data 


The data on hematologic toxicity are shown in Figure 8.1. The data consist of two 
variables: drug is the chemotherapy regimen for each patient and days represents the 
number of days the patient’s white blood count (WBC) was less than 500. The data 
consist of 28 cases. 


Figure 8.1 Data on hematologic toxicity 
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The exact results of the median test for these data are shown in Figure 8.2, and the results 
of the Monte Carlo estimate of the exact test, using 10,000 Monte Carlo samples, are 
shown in Figure 8.3. 
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Figure 8.2 Median test results for hematologic toxicity data 


Frequencies 


Drug Regimen 


Drug 1 Drug 2 Drug 3 Drug 4 Drug 5 


Days with | > Median 2 1 2 3 4 
WBC < Ze 
500 Median 2 4 3 6 1 


Test Statistics! 


Asymp. Point 
Median | Chi-Square ig. . | Probability 


Days with 
WBC < 
500 


!. Grouping Variable: Drug Regimen 
2. 9 cells (.0%) have expected frequencies less than 5. The minimum expected cell frequency is 1.7. 


Figure 8.3. Monte Carlo median test results for hematologic toxicity data 


Test Statistics! 


Monte Carlo Sig. 
99% Confidence 
Interval 
Asymp. Lower Upper 

N Median | Chi-Square df Sig. Sig. Bound Bound 
Days with 2 3 
WEC < 28 7.00 4.317 4 365 432 419 444 
500 


i Grouping Variable: Drug Regimen 
2- 9 cells (.0%) have expected frequencies less than 5. The minimum expected cell frequency is 1.7. 
3. Based on 10000 sampled tables with starting seed 2000000. 


The median for the pooled sample is 7.0. This results in the value 4.317 for the test 
statistic, based on Equation 8.25. The exact p value is 0.429 and does not provide any 
evidence that the five drugs produce different distributions for the WBC. The asymptotic 
p value, 0.365, supports this conclusion, but in this small data set, it is not a good 
approximation of the exact p value. On the other hand, the Monte Carlo estimate of the 
exact p value, 0.432, comes much closer to the exact p value. The 99% Monte Carlo 
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confidence interval for the exact p value, (0.419, 0.444) also supports the conclusion that 
there is no significant difference in the distribution of WBC across the five drugs. 

The following discussion shows the relationship between the median test and the 
Pearson chi-square test. The median of these data is 7.0. The data can be divided into 
two groups, with one group containing those cases with WBC <7 and the other group 
containing those cases with WBC > 7 . The crosstabulation of these two groups, divided 
by the median, with the five drug regimens, is shown in Figure 8.4. 


Figure 8.4 Hematologic toxicity data grouped into a 2 x K contingency table for the median 
test 


Count 
Drug Regimen 
Drug 1 Drug 2 Drug 3 Drug 4 Drug 5 
GROUP | WBC <=7 2 4 3 6 1 
WBC > 7 2 1 2 3 4 


The results of the Pearson chi-square test are shown in Figure 8.5. Notice that the results 
are the same as those obtained by running the median test on the original one-way layout 
of data. 


Figure 8.5 Pearson’s chi-square results for hematologic toxicity data, divided by the 
median 


Chi-Square Tests 


Asymp. 
Sig. Exact Sig. 
Value df (2-tailed) | (2-tailed) 


Pearson 
Chi-Square 


N of Valid Cases 28 


1 
4.317 4 365 429 


1. 9 cells (90.0%) have expected count less than 5. The minimum 
expected count is 1.71. 


Kruskal-Wallis Test 


The Kruskal-Wallis test (Siegel and Castellan, 1988) is a very popular nonparametric 
test for comparing K independent samples. When K = 2, it specializes to the Mann- 
Whitney test. The Kruskal-Wallis test has good power against shift alternatives. 
Specifically, you assume, as in Hollander and Wolfe (1973), that the one-way layout, u, 
shown in Table 8.2, was generated by the model 


Ui; = wut T; + Ei; Equation 8.29 
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for all i = 1,2,...n; and j = 1,2,...K. In this model, u is the overall mean, T is the 
treatment effect, and the €;;’s are identically distributed unobservable error terms from 
an unknown distribution with a mean of 0. All parameters are unknown, but for identi- 
fiability, you can assume that 


K 
> T= 0 Equation 8.30 
j=l 


The null hypothesis of no treatment effect can be formally stated as 


Ho: 1) = 1) =. = Tx Equation 8.31 
The Kruskal-Wallis test has good power against the alternative hypothesis 


H,: Ti) # Tp for at least one (jj) pair Equation 8.32 


Notice that this alternative hypothesis does not specify any ordering of the treatments in 
terms of increases in response levels. The alternative to the null hypothesis is simply that 
the treatments are different, not that one specific treatment elicits greater response than 
another. If there were a natural ordering of treatments under the alternative hypothesis— 
if, that is, you could state a priori that the T, ’s are ordered under the alternative hypoth- 
esis—a more powerful test would be the Jonckheere-Terpstra test (Hollander and Wolfe, 
1973), discussed on p. 135. 

To define the Kruskal-Wallis test statistic, the first step is to convert the one-way layout, 
u, of raw data, as shown in Table 8.2, into a corresponding one-way layout of scores, w, as 
shown in Table 8.3. The scores, w;;, for the Kruskal-Wallis test are the ranks of the obser- 
vations in the pooled sample of size N. If there were no ties, the set of w,; values in Table 
8.3 would simply be some permutation of the first NV integers. However, to allow for the 
possibility that some observations might be tied, you can assign the mid-rank of a set of tied 
observations to each of them. The easiest way to explain how the mid-ranks are computed 
is by considering a numerical example. Suppose that u)3,u,7,U 1 ,U3 are all tied at the same 
numerical value, say 55. Assume that these four observations would occupy positions 15, 
16, 17, and 18, if all the NV observations were pooled and then sorted in ascending order. In 
this case, you would assign the mid-rank (15 + 16+ 17+ 18)/2 = 16.5 to these four tied 
observations. Thus, Wj, = W,7 = Wa; = Wz. = 16.5. 

More generally, let a, <@,<...<@, denote the pooled sample of all of the NV 
observations sorted in ascending order. To allow for the possibility of ties, let there be 
g distinct observations among the sorted o;’s, with e, distinct observations being equal 
to the smallest value, e, distinct observations being equal to the second smallest value, 
e, distinct observations being equal to the third smallest value, and so on, until, finally, 
é, distinct observations are equal to the largest value. It is now possible to define the 
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mid-ranks precisely. For / = 1,2,...g, the distinct mid-rank assumed by all of the e, 
observations tied in the /th smallest position is 


w/t = e€;+@o+... +e, ,+(e,+1)/2 


In this way, the original one-way layout of raw data is converted into a corresponding 
one-way layout of mid-ranks. 
Next, for any treatment, where j = 1,2,...,K, define the rank-sum as 


w.= W.- Equation 8.33 


The Kruskal-Wallis test statistic, T(w)= 7, for any w € W, can now be defined as 


K 


12 ~ 2 : 
T= by [w= nN + 1)/2] /N; Equation 8.34 


; NN +1)[1 = (IN? — NY) 


where X is a tie correction factor given by 


& 
3 
N= > (1 —é)) Equation 8.35 


l=1 


The Kruskal-Wallis test is also defined in Chapter 11, using the notation developed for 
analyzing r X c contingency tables. The two definitions are equivalent. Since the test is 
applicable to both continuous and categorical data, the test statistic is defined twice, 
once in the context of a one-way layout and once in the context of a contingency table. 

Let ¢ denote the value of T actually observed from the data. The exact, Monte Carlo, 
and asymptotic p values based on the Kruskal-Wallis statistic can be obtained as 
discussed in “P Value Calculations” on p. 123. The exact two-sided p value is computed 
as shown in Equation 8.7. The Monte Carlo two-sided p value is computed as in 
Equation 8.11, and the asymptotic two-sided p value is computed as shown in Equation 
8.16. One-sided p values are not defined for tests against unordered alternatives like the 
Kruskal-Wallis test. 


Example: Hematologic Toxicity Data, Revisited 


The Kruskal-Wallis test can be used to reconsider the hematologic toxicity data 
displayed in Figure 8.1. You can once again compare the five drugs to determine if they 
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have significantly different response distributions. This time, however, the test statistic 
actually takes advantage of the relative rankings of the different observations instead of 
simply using the information that an observation is either above or below the pooled 
median. Thus, you can expect the Kruskal-Wallis test to be more powerful than the 
median test. Although it is too difficult to obtain the exact p value for this data set, you 
can obtain an extremely accurate Monte Carlo estimate of the exact p value based on a 
Monte Carlo sample of size 10,000. The results are shown in Figure 8.6. 


Figure 8.6 Monte Carlo results of Kruskal-Wallis test for hematologic toxicity data 


Ranks 
Mean 
N Rank 
Days with | Drug Drug 1 4 11.88 
WEC < Regimen Drug 2 5 7.50 
500 
Drug 3 5 17.70 
Drug 4 9 13.50 
Drug 5 5 22.20 
Total 28 
2 


Test Statistics” 


Monte Carlo Sig. 


99% Confidence 
Interval 


Chi-Square 


Days with 
WBC < 9.415 
500 


|. Kruskal-Wallis Test 
7 Grouping Variable: Drug Regimen 
3: Based on 1000 sampled tables with starting seed 2000000. 


As expected, the greater power of the Kruskal-Wallis test leads to a smaller p value than 
obtained with the median test. There is, however, a difference between the asymptotic 
inference and the exact inference computed by the Monte Carlo estimate. The Monte 
Carlo estimate of the exact p value is 0.038 and shows that the exact p value is 
guaranteed to lie in the range (0.033, 0.043) with 99% confidence. Thus, the null 
hypothesis can be rejected at the 5% significance level. The asymptotic inference, in 
contrast, was unable to estimate the true p value with this degree of accuracy. It 
generated a p value of 0.052, which is not significant at the 5% level. 
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Jonckheere-Terpstra Test 


The Jonckheere-Terpstra test (Hollander and Wolfe, 1973) is more powerful than the 
Kruskal-Wallis test for comparing K samples against ordered alternatives. Once again, 
assume that the one-way layout shown in Table 8.2 was generated by the model Equation 
8.29. The null hypothesis of no treatment effect is again given by Equation 8.31. This 
time, however, suppose that the alternative hypothesis is ordered. Specifically, the one- 
sided alternative might be of the form 


Hy: T8188 Te Equation 8.36 


implying that as you increase the index /, identifying the treatment, the distribution of 
responses shifts to the right. Or else, the one-sided alternative might be of the form 


Ay: 1,27), 2...2Tr Equation 8.37 


implying that as you increase the index /, identifying the treatment, the distribution shifts 
to the left. The two-sided alternative would state that either H, or H, is true, without 
specifying which. 

To define the Jonchkeere-Terpstra statistic, the first step, as usual, is to replace the 
original observations with scores. Here, however, let the score, Wi> be exactly the same 
as the actual observation, u;;. Then w = wu and W, as defined by Equation 8.3, is the set 
of all possible permutations of the one-way layout of actually observed raw data. Now, 
for any we W, you compute K(K—1)/2 Mann-Whitney counts (see, for example, 
Lehmann, 1976,), {A,,}, |<a<(K-1), (a+1)<b<K as follows. For any (a,b), 
A,» is the count of the number of pairs, (Woop 5)» which are such that Woa< pp Plus 
half the number of pairs, which are such that (w,,, = Wg,). The Jonckheere-Terpstra test 
statistic, T(w) = 7, is defined as follows: 


T= y Ey dab Equation 8.38 


The mean of the Jonckheere-Terpstra statistic is 


2 K 2 

N - n. 

e Dea 
4 


E(T) Equation 8.39 


The formula for the variance is more complicated. Suppose, as in “Kruskal-Wallis Test” 
on p. 131, that there are g distinct w;;’s among all N observations pooled together, with 
e, distinct observations being equal to the smallest value, e, distinct observations 
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being equal to the second smallest value, e, distinct observations being equal to the 
third smallest value, and so on, until, finally, e, distinct observations are equal to the 
largest value. The variance of the Jonckheere-Terpstra statistic is 


g& 
ope +5 N(N- DON#5)— Fn, DOM, +5)— Fee 2 +5) 


j=l l=1 
K g 
1 


+ ENON DONT D| DU LO 2) |X] Dees 1)(@,-2) 


j=l J=1 
K g 
Pe eee 


* 8N(N—1) Dy - || Dele 


j=1 1-1 


Now, let t(w) =t be the observed value of T. The exact, Monte Carlo, and asymptotic p 
values based on the Jonckheere-Terpstra statistic can be obtained as discussed in “P 
Value Calculations” on p. 123. The exact one- and two-sided p values are computed as in 
Equation 8.8 and Equation 8.9, respectively. The Monte Carlo two-sided p value is 
computed as in Equation 8.11, with an obvious modification to reflect the fact that you 
want to estimate the probability inside the region { |t — E(T)| = |t - E(T)|} instead of the 
region {7 =t}. The Monte Carlo one-sided p value can be similarly defined. The 
asymptotic distribution of T is normal, with mean of E(7) and variance o7. The 
asymptotic one- and two-sided p values are obtained by Equation 8.17 and Equation 
8.18, respectively. 


Example: Space-Shuttle O-Ring Incidents Data 


Professor Richard Feynman, in his delightful book What Do You Care What Other 
People Think? (1988), recounted at great length his experiences as a member of the 
presidential commission formed to determine the cause of the explosion of the space 
shuttle Challenger in 1986. He suspected that the low temperature at takeoff caused the 
O-rings to fail. In his book, he has published the data on temperature versus the number 
of O-ring incidents, for 24 previous space shuttle flights. These data are shown in Figure 
8.7. There are two variables in the data—incident indicates the number of O-ring 


incidents, and is either none, one, two, or three; temp indicates the temperature in 
Fahrenheit. 
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Figure 8.7 Space-shuttle O-ring incidents and temperature at launch 


The null hypothesis is that the temperatures in the four samples (0, 1, 2, or 3 O-ring 
incidents) have come from the same underlying population distribution. The one-sided 
alternative hypothesis is that populations with a higher number of O-ring incidents have 
their temperature distributions shifted to the right of populations with a lower number 
of O-ring incidents. The Jonckheere-Terpstra test is superior to the Kruskal-Wallis test 
for this data set because the populations have a natural ordering under the alternative 
hypothesis. The results of the Jonckheere-Terpstra test for these data are shown in 
Figure 8.8. 
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Figure 8.8 Jonckheere-Terpstra test results for O-ring incidents data 


Jonckheere-Terpstra Test! 


Std. 

Observed Mean Deviation 
J-T J-T of J-T 

Statistic Statistic Statistic 


Asymp. Exact 
Std. J-T Sig. Significance |Exact Sig. Point 
Statistic | (2-tailed) (2-tailed) (1-tailed) | Probability 


Incidents 


Temperature 
(Fahrenheit) 


4 29.500 65.000 15.902 -2.232 026 024 012 .001 


1. Grouping Variable: O-Ring Incidents 


The Jonckheere-Terpstra test statistic is displayed in its standardized form 


T* = eee) Equation 8.40 
OF 


whose observed value is 


pe = EM 
Or 


Equation 8.41 


The output shows that ¢ = 29.5, E(T) = 65, and o, = 15.9. Therefore, 
t* = —2,.232. The exact one-sided p value is 


P, = min{Pr(7* 2 r*), Pr(T* < *)} Equation 8.42 


The exact two-sided p value is 


P2 = Pr(|7*| > |¢*|) Equation 8.43 


These definitions are completely equivalent to those given by Equation 8.8 and Equation 
8.9, respectively. Asymptotic and Monte Carlo one- and two-sided p values can be sim- 
ilarly defined in terms of the standardized test statistic. Note that T* is asymptotically 
normal with zero mean and unit variance. 

The exact one-sided p value of 0.012 reveals that there is indeed a statistically signif- 
icant correlation between temperature and number of O-ring incidents. The sign of the 
standardized test statistic, f* = —2.232, is negative, thus implying that higher launch 
temperatures are associated with fewer O-ring incidents. The two-sided p value would 
be used if you had no a priori reason to believe that the number of O-ring incidents is 
negatively correlated with takeoff temperature. Here the exact two-sided p value, 0.024, 
is also statistically significant. 


Introduction to Tests on R x C 
Contingency Tables 


This chapter discusses hypothesis tests on data that are cross-classified into 
contingency tables with r rows and c columns. The cross-classification is based on 
categorical variables that may be either nominal or ordered. Nominal categorical 
variables take on distinct values that cannot be positioned in any natural order. An 
example of a nominal variable is color (for example, red, green, or blue). In some 
statistical packages, nominal variables are also referred to as class variables, or 
unordered variables. Ordered categorical variables take on distinct values that can be 
ordered in a natural way. An example of an ordered categorical variable is drug dose 
(for example, low, medium, or high). Ordered categorical variables can assume 
numerical values as well (for example, the drug dose might be categorized into 100 
mg/m, 200 mg/m?, and 300 mg/m?). When the number of distinct numerical values 
assumed by the ordered variable is very large (for example, the weights of individuals 
in a population), it is more convenient to regard the variable as continuous (possibly 
with ties) rather than categorical. There is considerable overlap between the statistical 
methods used to analyze continuous data and those used to analyze ordered 
categorical data. Indeed, many of the same statistical tests are applicable to both 
situations. However, the probabilistic behavior of an ordered categorical variable is 
captured by a different mathematical model than that of a continuous variable. For this 
reason, continuous variables are discussed separately in Part 1. 

This chapter summarizes the statistical theory underlying the exact, Monte Carlo, 
and asymptotic p value computations for all the tests in Chapter 10, Chapter 11, and 
Chapter 12. Chapter 10 discusses tests for r x c contingency tables in which the row 
and column classifications are both nominal. These are referred to as unordered con- 
tingency tables. Chapter 11 discusses tests for rx c contingency tables in which the 
column classifications are based on ordered categorical variables. These are referred to 
as singly ordered contingency tables. Chapter 12 discusses tests for r x c tables in 
which both the row and column classifications are based on ordered categorical vari- 
ables. These are referred to as doubly ordered contingency tables. 

Table 9.1 shows an observed r x c contingency table in which x;, is the count of 


. . . y 
the number of observations falling into row category i and column category /. 
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Table 9.1 Observed r x c contingency table 


Rows CoL1 Col2.... Colc  Row_Total 
Row_1l x14 X49 de Xi¢ mM, 
Row_2 X>1 X99 nite X46 mM) 
Row_r xX,4 xX ,.2 we Xre m,. 
Col_Total 7, Ny tts n, N 


The main objective is to test whether the observed r x c contingency table is consistent 
with the null hypothesis of independence of row and column classifications. Exact Tests 
computes both exact and asymptotic p values for many different tests of this hypothesis 
against various alternative hypotheses. These tests are grouped in a logical manner and 
are presented in the next three chapters, which discuss unordered, singly ordered, and 
doubly ordered contingency tables, respectively. Despite these differences, there is a 
unified underlying framework for performing the hypothesis tests in all three situations. 
This unifying framework is discussed below in terms of p value computations. 

The p value of the observed r x c contingency table is used to test the null hypothesis 
of no row-by-column interaction. Exact Tests provides three categories of p values for 
each test. The “gold standard” is the exact p value. When it can be computed, the exact 
p value is recommended. Sometimes, however, a data set is too large for the exact p 
value computations to be feasible. In this case, the Monte Carlo technique, which is 
easier to compute, is recommended. The Monte Carlo p value is an extremely close 
approximation to the exact p value and is accompanied by a fairly narrow confidence 
interval within which the exact p value is guaranteed to lie (at the specified confidence 
level). Moreover, by increasing the number of Monte Carlo samples, you can make the 
width of this confidence interval arbitrarily small. Finally, the exact p value is always 
recommended. For large, well-balanced data sets, the asymptotic p value is not too 
different from its exact counterpart, but, obviously, you can’t know this for the specific 
data set on hand without also having the exact or Monte Carlo p value available for 
comparison. In this section, all three p values will be defined. First, you will see how the 
exact p value is computed. Then, the Monte Carlo and asymptotic p values will be 
discussed as convenient approximations to the exact p value computation. 

To compute the exact p value of the observed r x c contingency table, it is necessary to: 


1. Define a reference set of r xc tables in which each table has a known probability 
under the null hypothesis of no row-by-column interaction. 


2. Order all the tables in the reference set according to a discrepancy measure (or test 
statistic) that quantifies the extent to which each table deviates from the null hypothesis. 


3. Sum the probabilities of all tables in the reference set that are at least as discrepant as 
the observed table. 
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Defining the Reference Set 


Throughout this chapter, x will be used to denote the 7 x c contingency table actually 
observed, and y will denote any generic r x c contingency table belonging to some well- 
defined reference set of rx c contingency tables that could have been observed. The 
exact probability of observing any generic table y depends on the sampling scheme used 
to generate it. When both the row and column classifications are categorical, Agresti 
(1990) lists three sampling schemes that could give rise to y—full multinomial sampling, 
product multinomial sampling, and Poisson sampling. Under all three schemes, the 
probability of observing y depends on unknown parameters relating to the individual cells 
of the rx c table. The key to exact nonparametric inference is eliminating all nuisance 
parameters from the distribution of y. This is accomplished by restricting the sample 
space to the set of all rx c contingency tables that have the same marginal sums as the 
observed table x. Specifically, define the reference set: 


C. r 
T= yiyisrxe;} yy=mey y,, =n, forall i,j Equation 9.1 
ij i ae) 
j=l i=l 


Then, you can show that, under the null hypothesis of no row-by-column interaction, the 
probability of observing any y € Tis 


Equation 9.2 


Equation 9.2, which is free of all unknown parameters, holds for categorical data wheth- 
er the sampling scheme used to generate y is full multinomial, product multinomial, or 
Poisson (Agresti, 1990). 

The reference set T need not be the actual sample space of the data-generating 
process. In product multinomial sampling, the row sums are fixed by the experimental 
design, but the column sums can vary from sample to sample. In full multinomial and 
Poisson sampling, both the row and column sums can vary. Conditioning on row and 
column sums is simply a convenient way to eliminate nuisance parameters from the 
expression for P(y), compute exact p values, and thus guarantee that you will be 
protected from a conditional type | error at any desired significance level. Moreover, 
since the unconditional type | error is a weighted sum of conditional type | errors, where 
the weights are the probabilities of the different marginal configuration, the protection 
from type | errors guaranteed by the conditional test carries over to the unconditional 
setting. The idea of conditional inference to eliminate nuisance parameters was first 
proposed by Fisher (1925). 
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Defining the Test Statistic 


For statistical inference, each table y € TI’ is ordered by a test statistic or discrepancy 
measure that quantifies the extent to which the table deviates from the null hypothesis 
of no row-by-column interaction. The test statistic will be denoted by D(y). Large abso- 
lute values of D furnish evidence against the null hypothesis, while small absolute values 
are consistent with it. The functional form of D(y) for each test is given in the chapter 
specific to each test. Throughout this chapter, the function D(y) will be used to denote a 
generic test statistic. Specific instances of test statistics will be denoted by their own 
unique symbols. For example, for the Pearson chi-square test, the generic symbol D(y) 
is replaced by CH(y), and the test statistic has the functional form of 


2 
ree Vij - mnj/N) 


cHy = yy 


Equation 9.3 
i=lj=1 mnj/N 


Exact Two-Sided P Values 


The exact two-sided p value is defined as the sum of null probabilities of all the tables 
in T that are at least as extreme as the observed table x with respect to D. Specifically, 


p= >, P(y) = Pr{D(y) = D(x)} Equation 9.4 
Dy) = D(x) 


For later reference, define the critical region of the reference set: 
I” = {ye F:D(y)2D(x)} Equation 9.5 


Computing Equation 9.4 is sometimes rather difficult because the size of the reference 
set T grows exponentially. For example, the reference set of all 5 x 6 tables with row 
sums of (7, 7, 12, 4, 4) and column sums of (4, 5, 6, 5, 7, 7) contains 1.6 billion tables. 
However, the tables in this reference set are all rather sparse and unlikely to yield accu- 
rate p values based on large sample theory. Exact Tests uses network algorithms based 
on the methods of Mehta and Patel (1983, 1986a, 1986b) to enumerate the tables in T 
implicitly and thus quickly identify those in . . This makes it feasible to compute exact 
p values for many seemingly intractable data sets such as the one above. 
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Notwithstanding the availability of the network algorithms, a data set is sometimes 
too large for the exact p value to be feasible to compute. But it might be too sparse for 
the asymptotic p value to be reliable. For this situation, Exact Tests also provides a 
Monte Carlo option, where only a small proportion of the r x c tables inT are sampled, 
and an unbiased estimate of the exact p value is obtained. 


Monte Carlo Two-Sided P Values 


The Monte Carlo two-sided p value is a very close approximation to the exact two-sided 
p value, but it is much easier to compute. The examples in Chapter 10, Chapter 11, and 
Chapter 12 will show that, for all practical purposes, the Monte Carlo results can be used 
in place of the exact results whenever the latter are too difficult to compute. The Monte 
Carlo approach is a steady, reliable procedure that, unlike the exact approach, always takes 
up a predictable amount of computing time. While it does not produce the exact p value, 
it does produce a fairly tight confidence interval within which the exact p value is 
contained, with a high degree of confidence (usually 99%). 

In the Monte Carlo method, a total of M tables is sampled from I, each table being 
sampled in proportion to its hypergeometric probability (see Equation 9.2). (Sampling 
tables in proportion to their probabilities is known as crude Monte Carlo sampling.) 

For each table y,¢ T° that is sampled, define the binary outcome z; = 1 ify,;e T ; 
0 otherwise. The arithmetic average of all M of these z,’s is taken as the Monte Carlo 
point estimate of the exact two-sided p value: 


x 1 La 
p=t Zz. Equation 9.6 
2 M 2 J 
J = 


It is easy to show that P, is an unbiased estimate of the exact two-sided p value. Next, 


1/2 


~t a 
is M-1 Dae? Equation 9.7 
J = 


is the sample standard deviation of the z,’s. Then a 99% confidence interval for the 
exact p value is 


Gr= p, #2.5766/( JM) Pquation’9.8 
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A technical difficulty arises when either p. = 0 or p. = 1. The sample standard de- 
viation is now zero, but the data do not support a confidence interval of zero width. An 
alternative way to compute a confidence interval that does not depend on o is based on 
inverting an exact binomial hypothesis test when an extreme outcome is encountered. It 
can be easily shown that if P= 0, an @% confidence interval for the exact p value is 


r= 011 a7 i100, Equation 9.9 


Similarly, when p= 1 , an o& % confidence interval for the exact p value is 
y: P, P 


CI = [(—a100)'”™, 1] Faquadon 0.10 


Asymptotic Two-Sided P Values 


For all the tests in this chapter, the test statistic D(v) has an asymptotic chi-square dis- 
tribution. The asymptotic two-sided p value is obtained as 


p = Pr(y?>D(x)\df) Equation 9.11 
2 


where x” is a random variable with a chi-square distribution and df are the appropriate 
degrees of freedom. For tests on unordered r x c contingency tables, the degrees of free- 
dom are (r—1)x(c-—1) ; for tests on singly ordered r x c contingency tables, the de- 
grees of freedom are (7 — 1); and tests on doubly ordered contingency tables have one 
degree of freedom. Since the square root of a chi-square variate with one degree of free- 
dom has a standard normal distribution, you can also work with normally distributed test 
statistics for the doubly ordered r x c contingency tables. 


Unordered R x C Contingency 
Tables 


The tests in this chapter are applicable to rx c contingency tables whose rows and 
columns cannot be ordered in a natural way. In the absence of such an ordering, it is not 
possible to specify any particular direction for the alternative to the null hypothesis that the 
row and column classifications are independent. The tests considered here are appropriate 
in this setting because they have good power against the omnibus alternative, or universal 
hypothesis, that the row and column classifications are not independent. Subsequent chap- 
ters deal with tests that have good power against more specific alternatives. 


Available Tests 


Exact Tests offers three tests for analyzing unordered r x c contingency tables. They 
are the Pearson chi-square test, the likelihood-ratio test, and Fisher’s exact test. As- 
ymptotically, all three tests follow the chi-square distribution with (r—1)(c-1) de- 
grees of freedom. Both exact and asymptotic p values are available from Exact Tests. 
The asymptotic p value is provided by default, while the exact p value must be specif- 
ically requested. If a data set is too large for the exact p value to be computed, Exact 
Tests offers a special option whereby the exact p value is estimated up to Monte Carlo 
accuracy. Table 10.1 shows the three available tests, the procedure from which they can 
be obtained, and a bibliographical reference for each test. 


Table 10.1 Available tests 


Test Procedure Reference 


Pearson chi-square test Crosstabs Agresti (1990) 
Likelihood-ratio test Crosstabs Agresti (1990) 


Fisher’s exact test Crosstabs Freeman and 
Halton (1951) 


When to Use Each Test 


Any of the three tests, Pearson, likelihood-ratio, or Fisher’s, may be used when both 
the row and column classifications of the 7 x c contingency table are unordered. All 
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three tests are asymptotically equivalent. The research in this area is scant and has 
focused primarily on the question of which of the three asymptotic tests best matches its 
exact counterpart. (See, for example, Roscoe and Byars, 1971; Chapman, 1976; Agresti 
and Yang, 1987; Read and Cressie, 1988.) It is very likely that the Pearson chi-square 
asymptotic test converges to its exact counterpart the fastest. You can use the Exact Tests 
option to investigate this question and also to determine empirically which of the three 
exact tests has the most power against specific alternative hypotheses. 


Statistical Methods 


For the r x c contingency table shown in Table 9.1, 1,; denotes the probability that an 
observation will be classified as belonging to row i and column j. Define the marginal 
probabilities: 


Cc 
t= ur fori=1,2,...,7 


j=l 


5 
T= ui for j = 1,2,...,¢ 

t=] 
The Pearson chi-square test, the likelihood-ratio test, and Fisher’s exact test are all ap- 
propriate for testing the null hypothesis 
Ag:Nj; = 1474; for all (i, 7)pairs Equation 10.1 
against the general (omnibus) alternative that Equation 10.1 does not hold. An alternative 
hypothesis of this form is of interest when there is no natural ordering of the rows and 
columns of the contingency table. Thus, these three tests are usually applied to unordered 
rxXc contingency tables. Note that all three tests are inherently two-sided in the follow- 
ing sense. A large positive value of the test statistic is evidence that there is at least one 
(i,/) pair for which Equation 10.1 fails to hold, without specifying which pair. 

If the sampling process generating the data is product multinomial, one set of mar- 
ginal probabilities (the 7;,’s, say) will equal unity. Then Hy reduces to the statement 
that the c multinomial probabilities are the same for all rows. In other words, the null 
hypothesis is equivalent to 


Ho:™; Fg, = = My = My; for all j = 1,2,...c Equation 10.2 
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In practice, product multinomial sampling arises when r populations are compared and 
the observations from each population fall into c distinct categories. The null hypothesis 
is that the multinomial probability of falling in the jth category, 7 = 1, 2,...c, is the 
same for each population. The Pearson, likelihood-ratio, and Fisher’s tests are most 
suitable when the c categories have no natural ordering (for example, geographic 
regions of the country). However, more powerful tests, such as the Kruskal-Wallis test, 
are available if the c categories have a natural ordering (for example, levels of toxicity). 
Such tests are discussed in Chapter 11 and Chapter 12. 


Oral Lesions Data 


The exact, Monte Carlo, and asymptotic versions of the Pearson chi-square test, the 
likelihood-ratio test, and Fisher’s exact test can be illustrated with the following sparse 
data set. Suppose that data were obtained on the location of oral lesions, in house-to- 
house surveys in three geographic regions of rural India. These data are displayed here 
in the form of a 9 x 3 contingency table, as shown in Figure 10.1. The variables shown 
in the table are site, which indicates the specific site of the oral lesion, and region, which 
indicates the geographic region. Count represents the number of patients with oral 
lesions at a specific site and living in a specific geographic region. 


Figure 10.1 Crosstabulation of oral lesions data set 


Site of Lesion * Geographic Region Crosstabulation 


Count 


Geographic Region 
Gujarat Andhra 


Kerala 


Site of Labial 
Lesion Mucosa 


Buccal 
Mucosa 


Commissure 
Gingiva 
Hard Palate 
Soft Palate 
Tongue 1 
Floor of Mouth 1 1 


Alveolar 
Ridge 
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The question of interest is whether the distribution of the site of the oral lesion is signif- 
icantly different in the three geographic regions. The row and column classifications for 
this 9 x3 table are clearly unordered, making it an appropriate data set for either the 
Pearson, likelihood-ratio or Fisher’s tests. The contingency table is so sparse that the 
usual chi-square asymptotic distribution with 16 degrees of freedom is not likely to yield 
accurate p values. 


Pearson Chi-Square Test 


The Pearson chi-square test is perhaps the most commonly used procedure for testing 
null hypotheses of the form shown in Equation 10.1 or Equation 10.2 for indepen- 
dence of row and column classifications in an unordered r X c contingency table. For 
any observed r xc table, the test statistic, D(x), is denoted as CH(x) and is com- 
puted by the formula 


—mn,/N) 


H ust Raat Equati 
CH(x) = > 5 unmay/ mn WH quation 10.3 
i=1lj=1 


For the 9x3 contingency table of oral lesions data displayed in Figure 10.1, 
CH(x) = 22.1. The test statistic and its corresponding asymptotic and exact p values 
are shown in Figure 10.2. 


Figure 10.2 Exact and asymptotic Pearson chi-square test for oral lesions data 


Asymp. 
Sig. Exact Sig. 
Value df (2-tailed) | (2-tailed) 
Pearson 1 
Chi-Square 22.099 16 140 .027 


1. 25 cells (92.6%) have expected count less than 5. The 
minimum expected count is .26. 


The results show that the observed value of the test statistic is CH(x) = 22.1. This sta- 
tistic has an asymptotic chi-square distribution with 16 degrees of freedom. 

The asymptotic p value is based on the chi-square distribution with 16 degrees of 
freedom. The asymptotic p value is computed as the area under the chi-square density 
function to the right of CH(x) = 22.1. The p value of 0.14 implies that there is no row- 
by-column interaction. However, this p value cannot be trusted because of the sparse- 
ness of the observed contingency table. 

The exact p value is shown in the portion of the output entitled Exact Sig. (2-tailed). It 
is defined by Equation 9.4 as the permutational probability Pr(CH(y) 2 22.1 |y € I). The 
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exact p value is 0.027, showing that there is a significant interaction between the site of the 
lesion and the geographic region, but the asymptotic p value failed to demonstrate this. In 
this example, the asymptotic p value was more conservative than the exact p value. 

Sometimes the data set is too large for an exact analysis, and the Monte Carlo method 
must be used instead. Figure 10.3 shows an unbiased estimate of the exact p value for 
the Pearson chi-square test based on a crude Monte Carlo sample of 10,000 tables from 
the reference set. 


Figure 10.3 Monte Carlo results for oral lesions data 


Chi-Square Tests 


Monte Carlo Significance 
(2-tailed) 


99% Confidence 
Asymp. Interval 
Sig. Lower Upper 
Value df (2-tailed) Sig. Bound Bound 
isti 1 2 
BrsSCe | Pegioon 22.099 16 140 026 022 030 
Chi-Square 


1. 25 cells (92.6%) have expected count less than 5. The minimum expected count is .26. 
2. Based on 10000 and seed 2000000 ... 


The Monte Carlo method produces a 99% confidence interval for the exact p value. 
Thus, although the point estimate might change slightly if you resample with a different 
starting seed or a different random number generator, you can be 99% confident that the 
exact p value is contained in the interval 0.022 to 0.030. Moreover, you could always 
sample more tables from the reference set if you wanted to further narrow the width of 
this interval. Based on this analysis, it is evident that the Monte Carlo approach leads to 
the same conclusion as the exact approach, demonstrating that there is indeed a signifi- 
cant row-by-column interaction in this contingency table. The asymptotic inference 
failed to demonstrate any row-by-column interaction. 


Likelihood-Ratio Test 


The likelihood-ratio test is an alternative to the Pearson chi-square test for testing inde- 
pendence of row and column classifications in an unordered r x c contingency table. 
For any observed r x c contingency table, the test statistic, D(x), is denoted as L/(x) 
and is computed by the formula 


r c x... 
LI(x) =2 x,jlog(—Z } Equation 10.4 


150 Chapter 10 


For the oral lesions data displayed in Figure 10.1, L/(x) = 23.3. The test statistic and 
its corresponding asymptotic and exact p values are shown in Figure 10.4. 


Figure 10.4 Results of likelihood-ratio test for oral lesions data 


Chi-Square Tests 


Values 
Asymp. 
Sig. Exact Sig. 
Value df (2-tailed) | (2-tailed) 
Statistics | Likelihood Ratio 23.297 16 106 .036 


The output shows that the observed value of the test statistic is LJ(x) = 23.3. This sta- 
tistic has an asymptotic chi-square distribution with 16 degrees of freedom. The asymp- 
totic p value is computed as the area under the chi-square density function to the right 
of LJ(x) = 23.3. The p value of 0.106 implies that there is no row-by-column interac- 
tion. However, this p value cannot be trusted because of the sparseness of the observed 
contingency table. 

The exact p value is defined by Equation 9.4 as the permutational probability 
Pr(LI(y) 2 23.3|y € I). The exact p value is 0.036, showing that there is a significant 
interaction between the site of lesion and the geographic region, but the asymptotic p value 
failed to demonstrate this. In this example, the asymptotic p value was more conservative 
than the exact p value. 

Sometimes the data set is too large for an exact analysis, and the Monte Carlo method 
must be used instead. Figure 10.5 shows an unbiased estimate of the exact p value for 
the likelihood-ratio test based on a crude Monte Carlo sample of 10,000 tables from the 
reference set. 


Figure 10.5 Estimate of exact p value for likelihood-ratio test based on Monte Carlo 
sampling 


Chi-Square Tests 


Monte Carlo Significance 
(2-tailed) 
99% Confidence 
Asymp. Interval 
Sig. Lower Upper 
Value df (2-tailed) Sig. Bound Bound 
Statistics | Likelihood Ratio 23.297 16 106 .0357 .030 .039 


2. Based on 10000 and seed 2000000 ... 
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The Monte Carlo point estimate is 0.035, which is acceptably close to the exact p value 
of 0.036. More important, the Monte Carlo method also produces a confidence interval 
for the exact p value. Thus, although this point estimate might change slightly if you re- 
sample with a different starting seed or a different random number generator, you can 
be 99% confident that the exact p value is contained in the interval 0.030 to 0.039. More- 
over, you could always sample more tables from the reference set if you wanted to fur- 
ther narrow the width of this interval. Based on this analysis, it is evident that the Monte 
Carlo approach leads to the same conclusion as the exact approach, demonstrating that 
there is indeed a significant row-by-column interaction in this contingency table. The 
asymptotic inference failed to demonstrate any row-by-column interaction. 


Fisher’s Exact Test 


Fisher’s exact test is traditionally associated with the single 2 x 2 contingency table. Its 
extension to unordered r x c tables was first proposed by Freeman and Halton (1951). 
Thus, it is also known as the Freeman-Halton test. It is an alternative to the Pearson chi- 
square and likelihood-ratio tests for testing independence of row and column 
classifications in an unordered r X c contingency table. Fisher’s exact test is available 
for tables larger than 2 x2 through the Exact Tests option. Asymptotic results are 
provided only for 2 x 2 tables, while exact and Monte Carlo results are available for 
larger tables. For any observed rxc contingency table, the test statistic, D(x), is 
denoted as F/(x) and is computed by the formula 


FI(x) = —2log(yP(x)) Equation 10.5 


where 


r Cc 
ee (2n)"- I)(e-1)/2y-(re- aT hae cP ia Equation 10.6 


i=1 j=l 
For the oral lesions data displayed in Figure 10.1, F/(x) = 19.72. The exact p values 
are shown in Figure 10.6. 
Figure 10.6 Fisher’s exact test for oral lesions data 


Chi-Square Tests 


Exact Sig. 
Value (2-tailed) 


Fisher's Exact 


Test 19.721 .010 
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The exact p value is defined by Equation 9.4 as the permutational probability 
Pr(FI(y) 2 19.72|y € I’). The exact p value is 0.010, showing that there is a significant 
interaction between the site of the lesion and the geographic region. The asymptotic result 
was off the mark and failed to demonstrate a significant outcome. In this example, the 
asymptotic p value was more conservative than the exact p value. 

Sometimes the data set is too large for an exact analysis, and the Monte Carlo 
method must be used instead. Figure 10.7 shows an unbiased estimate of the exact p 
value for Fisher’s exact test based on a crude Monte Carlo sample of 10,000 tables 
from the reference set. 


Figure 10.7 Monte Carlo estimate of Fisher’s exact test for oral lesions data 


Chi-Square Tests 


Values 


Monte Carlo Significance 
(2-tailed) 


99% Confidence 
Interval 
Lower Upper 
Bound Bound 
Statistics | Fisher's Exact 007 013 
Test 


1. Based on 10000 and seed 2000000 ... 


The Monte Carlo method produces a 99% confidence interval for the exact p value. 
Thus, although this point estimate might change slightly if you resample with a different 
starting seed or a different random number generator, you can be 99% confident that the 
exact p value is contained in the interval 0.007 to 0.013. Moreover, you could always 
sample more tables from the reference set if you wanted to further narrow the width of 
this interval. Based on this analysis, it is evident that the Monte Carlo approach leads to 
the same conclusion as the exact approach, demonstrating that there is indeed a signifi- 
cant row-by-column interaction in this contingency table. The asymptotic inference 
failed to demonstrate any row-by-column interaction. 


Singly Ordered R x C 
Contingency Tables 


The test in this chapter is applicable to 7 x c contingency tables in which the rows are 
unordered but the columns are ordered. This is a common setting, for example, when 
comparing r different drug treatments, each generating an ordered categorical response. 
It is assumed a priori that the treatments cannot be ordered according to their rate of 
effectiveness. If they can be ordered according to their rate of effectiveness—for exam- 
ple, if the treatments represent increasing doses of some drug—the tests in the next 
chapter are more applicable. 


Available Test 


Exact Tests offers the Kruskal-Wallis test for analyzing rx c contingency tables in 
which the rows (r) are unordered but the columns (c) have a natural ordering. Although 
the logic of the Kruskal-Wallis test can be applied to singly ordered contingency tables, 
this test is performed through the Nonparametric Tests: Tests for Several Independent 
Samples procedure. (See Siegal and Castellan, 1988.) 


When to Use the Kruskal-Wallis Test 


Use the Kruskal-Wallis test for an r x c contingency table in which the rows (7) are un- 
ordered but the columns (c) are ordered. Note that it is very important to keep the col- 
umns ordered, not the rows. In this chapter, the Kruskal-Wallis test is applied to ordinal 
categorical data. See Chapter 8 for a discussion of using this test for continuous data. 


Statistical Methods 


The data consist of c categorical responses generated by subjects in r populations, 
and cross-classified into an rx c contingency table, as shown in Table 9.1. The c 
categorical responses are usually ordered, whereas the r populations are not. Suppose 
there are m, subjects in population i and each subject generates a multinomial 
response falling into one of c ordered categories with respective multinomial 
probabilities of II; = (1,1, 1, ...,%;,) fori = 1,2,...,7r. 
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The null hypothesis is 


Ag, =, =... = Equation 11.1 


r 


The alternative hypothesis is that at least one set of multinomial probabilities is stochas- 
tically larger than at least one other set of multinomial probabilities. Specifically, for 
i= 1,2,...,7r, let 


J 
Le = ui 
=a 


The Kruskal-Wallis test is especially suited to detecting departures from the null hypoth- 
esis of the form 


Hy: for at least one(i,, 74) pair,Y; > Tp — ial Ere 63 Equation 11.2 


with strict inequality for at least one 7. In other words, you want to reject Hy) when at 
least one of the populations is more responsive than the others. 


Tumor Regression Rates Data 


The tumor regression rates of five chemotherapy regimens, Cytoxan (CTX) alone, 
Cyclohexyl-chloroethyl nitrosurea (CCNU) alone, Methotrexate (MTX) alone, 
CTX+MTX, and CTX+CCNU+MTX were compared in a small clinical trial. Tumor 
regression was measured on a three-point scale: no response, partial response, or 
complete response. The crosstabulation of the results is shown in Figure 11.1. 


Figure 11.1 Crosstabulation of tumor regression data 


Chemotherapy Regimen * Tumor Regression Crosstabulation 


Count 
Tumor Regression 
No Partial Complete 
Response | Response | Response 
Chemotherapy | CTMX 2 
Regimen CCNU 1 
MTX 3 
CTX+CCNU 2 2 
CTX+CCNU+MTX 1 1 4 
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Although Figure 11.1 shows the data in crosstabulated format to illustrate the concept 
of applying the Kruskal-Wallis test to singly ordered tables, this test is obtained from the 
Nonparametric Tests procedure, and your data must be structured appropriately for Non- 
parametric Tests. Figure 11.2 shows these data displayed in the Data Editor. The data 
consist of two variables. Chemo is a grouping variable that indicates the chemotherapy 
regimen, and regressn is an ordered categorical variable with three values, where 1=No 
Response, 2=Partial Response, and 3=Complete Response. Note that although variable 
labels are displayed, these variables must be numeric. 


Figure 11.2 Tumor regression data displayed in the Data Editor 


chemo regressn 
CTX No Response 
CTX No Response 
CCNU No Response 
CCNU Partial Response 
MTX No Response 
MTX No Response 
MTX No Response 
CTX+CCNU No Response 
CTX+CCNU No Response 
CTX+CCNU Partial Response 
CTX+CCNU Partial Response 
CTX+CCNU+MTX No Response 
CTX+CCNU+MTX Partial Response 
CTX+CCNU+MTX | Complete Response 
CTX+CCNU+MTX | Complete Response 
CTX+CCNU+MTX | Complete Response 
VF CTX+CCNU+MTX | Complete Response 


Small pilot studies like this one are frequently conducted as a preliminary step to 
planning a large-scale randomized clinical trial. The test in this section may be used to 
determine whether or not the five drug regimens are significantly different with respect 
to their tumor regression rates. Notice how appropriate the alternative hypothesis, 
shown in Equation 11.2, is for this situation. It can be used to detect departures from the 
null hypothesis in which one or more drugs shift the responses from no response to 
partial or complete responses. The results of the Kruskal-Wallis test are shown in Figure 
11.3. 
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Figure 11.3 Results of Kruskal-Wallis test for tumor regression data 


Ranks 
Tumor Chemotherapy | CTMX 2 5.00 
Regression | Regimen CCNU 2 8.25 
MTX 3 5.00 
CTX+CCNU 4 8.25 
CTX+CCNU+MTX 6 13.08 
Total 17 
Test Statistics!” 
Asymp. Exact Point 
Chi-Square df Sig. Sig. Probability 
ae 8.682 4 070 039 001 
Regression 


1. Kruskal Wallis Test 
2. Grouping Variable: Chemotherapy Regimen 


The observed value of the test statistic ¢, calculated by Equation 8.34, is 8.682. The 
asymptotic two-sided p value is based on the chi-square distribution with four degrees 
of freedom. The asymptotic p value is obtained as the area under the chi-square density 
function to the right of 8.682. This p value is 0.070. However, this p value is not reliable 
because of the sparseness of the observed contingency table. 

The exact p value is defined by Equation 8.7 as the permutational probability 
Pr(7 2 8.682|ye I). The exact p value is 0.039, which implies that there is a 
statistically significant difference between the five modes of chemotherapy. The 
asymptotic inference failed to demonstrate this. Below the exact p value is the point 
probability Pr(7 = 8.682). This probability, 0.001, is a natural measure of the 
discreteness of the test statistic. Some statisticians recommend subtracting half of its 
value from the exact p value, in order to yield a less conservative mid-p value. (For more 
information on the role of the mid-p method in exact inference, see Lancaster, 1961; Pratt 
and Gibbons, 1981; and Miettinen, 1985.) 

Sometimes the data set is too large for an exact analysis, and the Monte Carlo method 
must be used instead. Figure 11.4 shows an unbiased estimate of the exact p value for 
the Kruskal-Wallis test based on a crude Monte Carlo sample of 10,000 tables from the 
reference set. 
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Figure 11.4 Monte Carlo results for tumor regression data 


Test Statistics! 


Monte Carlo Sig. 


99% Confidence 


' Kruskal Wallis Test 
2. Grouping Variable: Chemotherapy Regimen 


3. Based on 10000 sampled tables with starting seed 20000000. 


Interval 
Asymp. Lower Upper 
Chi-Square df Sig. Sig. Bound Bound 
3 
TOP 8.682 4 070 043 037 048 
Regression 
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The Monte Carlo point estimate is 0.043, which is practically the same as the exact p 
value of 0.039. Moreover, the Monte Carlo method also produces a confidence interval 
for the exact p value. Thus, although this point estimate might change slightly if you 
resample with a different starting seed or a different random number generator, you can 
be 99% confident that the exact p value is contained in the interval 0.037 to 0.048. More 
tables could be sampled from the reference set to further narrow the width of this 
interval. Based on this analysis, it is evident that the Monte Carlo approach leads to the 
same conclusion as the exact approach, demonstrating that there is indeed a significant 
row and column interaction in this contingency table. The asymptotic inference 
produced a p value of 0.070, and thus failed to demonstrate a statistically significant 


row-by-column interaction. 


Doubly Ordered R x C 
Contingency Tables 


The tests in this chapter are applicable to r x c contingency tables in which both the 
rows and columns are ordered. A typical example would be an r xc _ table obtained 
from a dose-response study. Here the rows (r) represent progressively increasing doses 
of some drug, and the columns (c) represent progressively worsening levels of drug 
toxicity. The goal is to test the null hypothesis that the response rates are the same at all 
dose levels. The tests in this chapter exploit the double ordering so as to have good 
power against alternative hypotheses in which an increase in the dose level leads to an 
increase in the toxicity level. 


Available Tests 


Exact Tests offers two tests for doubly ordered rxc contingency tables: the 
Jonckheere-Terpstra test and the linear-by-linear association test. Asymptotically, 
both test statistics converge to the standard normal distribution or, equivalently, the 
squares of these statistics converge to the chi-square distribution with one degree of 
freedom. Both the exact and asymptotic p values are available from Exact Tests. The 
asymptotic p value is provided by default, while the exact p value must be specifically 
requested. If a data set is too large for the exact p value to be computed, Exact Tests 
offers a special option whereby the exact p value is estimated up to Monte Carlo ac- 
curacy. Although the logic of the Jonckheere-Terpstra test can be applied to doubly or- 
dered contingency tables, this test is performed through the Nonparametric Tests: Tests 
for Several Independent Samples procedure. Table 12.1 shows the two available tests, 
the procedure from which each can be obtained, and a bibliographical reference to each 
test. 


Table 12.1 Available tests 


Test Procedure Reference 

Jonckheere-Terpstra test Nonparametric Tests: Lehmann (1973) 
K Independent Samples 

Linear-by-linear association test | Crosstabs Agresti (1990) 
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In this chapter, the null and alternative hypotheses for these tests are specified, appro- 
priate test statistics are defined, and each test is illustrated with a data set. 


When to Use Each Test 


The Jonckheere-Terpstra and linear-by-linear association tests, while not asymptotically 
equivalent, are competitors for testing row and column interaction in a doubly ordered 
rxc table. There has been no formal statistical research on which test has greater 
power. Historically, the Jonckheere-Terpstra test was developed for testing continuous 
data in a nonparametric setting, while the linear-by-linear association test was used for 
testing categorical data in a loglinear models setting. However, either test is applicable 
for computing p values in r X c contingency tables as long as both the rows and columns 
have a natural ordering. In this chapter, the Jonckheere-Terpstra test is applied to ordinal 
categorical data. See Chapter 8 for a discussion of using this test for continuous data. 
The linear-by-linear association test has some additional flexibility in weighting the 
ordering and in weighting the relative importance of successive rows or columns of the 
contingency table through a suitable choice of row and column scores. This flexibility 
is illustrated in the treatment of the numerical example in “Linear-by-Linear Association 
Test” on p. 165. 


Statistical Methods 


Suppose that each response must fall into one of c ordinal categories according to a mul- 
tinomial distribution. Let m; responses from population i fall into the c ordinal categories 
with respective multinomial probabilities of 

TI, = (11, Wj, 1-10) 

fori = 1, 2,...,7. The null hypothesis is 

Ag:I1, =T, =... = TT, Equation 12.1 


To specify the alternative hypothesis, define 


Jj 
Yi = » Ti] Equation 12.2 
l=1 
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for i = 1, 2, ...,7. Since the rows are ordered, it is possible to define one-sided alterna- 
tive hypotheses of the form 


Hi, Yj < Yo; S.u.8 Yj Equation 12.3 
or 
ByeYy; 2 Yo; 2 ee Equation 12.4 


for 7 = 1, 2,...,c, with strict inequality of at least one 7. Both the Jonckheere-Terpstra 
and the linear-by-linear association tests are particularly appropriate for detecting 
departures from the null hypothesis of the form H, or H',, or for detecting the two-sided 
alternative hypothesis that either H, or H', is true. Hypothesis H, implies that as you 
move from row i to row (i+1), the probability of the response falling in category 
(j + 1) rather than in category j increases. Hypothesis H', states the opposite, that as 
you move down a row, the probability of falling into the next higher category decreases. 
The test statistics for the Jonckheere-Terpstra and the linear-by-linear association tests 
are so defined that large positive values reject H, in favor of H, , while large negative 
values reject H, in favor of H',. 


Dose-Response Data 


Patients were treated with a drug at four dose levels (100mg, 200mg, 300mg, 400mg) 
and then monitored for toxicity. The data are tabulated in Figure 12.1. 


Figure 12.1 Crosstabulation of dose-response data 


Drug Dose * TOXICITY Crosstabulation 


Count 


TOXICITY 


Notice that there is a natural ordering across both the rows and the columns of the above 
4x4 contingency table. There is also the suggestion that progressively increasing drug 
doses lead to increases in drug toxicity. 
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Jonckheere-Terpstra Test 


Figure 12.1 shows the data in crosstabulated format to illustrate the concept of applying 
the Jonckheere-Terpstra test to doubly ordered tables, however this test is obtained from 
the Nonparametric Tests procedure, and your data must be structured appropriately for 
Nonparametric Tests. Figure 12.2 shows a portion of these data displayed in the Data 
Editor. The data consist of two variables. Dose is an ordered grouping variable that 
indicates dose level, and toxicity is an ordered categorical variable with four values, 
where 1=Mild, 2=Moderate, 3=Severe, and 4=Death. Note that although value labels 
are displayed, these variables must be numeric. This is a large data set, with 227 cases, 
and therefore Figure 12.2 shows only a small subset of these data in order to illustrate 
the necessary data structure for the Jonckheere-Terpstra test. The full data set was used 
in the following example. 


Figure 12.2 Dose-response data, displayed in the Data Editor 


dose toxicity 
100 mg Mild 
100 mg Mild 
200 mg Severe 
100 mg Mild 
400 mg Moderate 
400 mg Mild 
100 mg Mild 
100 mg Mild 
300 mg Mild 
300 mg Severe 
200 mg Mild 
100 mg Mild 
100 mg Mild 
100 mg Moderate 
100 mg Mild 
400 mg Mild 
400 mg Mild 


You can run the Jonckheere-Terpstra test on the dose-response data shown in Figure 
12.2. The results are shown in Figure 12.3. 
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Figure 12.3 Results of Jonckheere-Terpstra test for dose-response data 
Number 
of Std. 
Levels Observed Mean Deviation Asymp. Exact 
in Drug J-T J-T of J-T Std. J-T Sig. Significance |Exact Sig. Point 
Dose N Statistic Statistic Statistic | Statistic | (2-tailed) (2-tailed) (1-tailed) | Probability 

TOXICITY 4 227 | 9127.000 | 8827.500 181.760 1.648 .099 -100 .049 -000 


1. Grouping Variable: Drug Dose 


The value of the observed test statistic, defined by Equation 8.38, is t = 9127, the mean 
is E(T) = 8828, the standard deviation is 181.8, and the standardized test statistic, cal- 
culated by Equation 8.41, is ¢* = 1.65. The standardized statistic is normally distributed 
with a mean of 0 and a variance of 1, while its square is chi-square distributed with one 
degree of freedom. 

The asymptotic two-sided p values are evaluated as the tail areas under a standard 
normal distribution. In calculating the one-sided p value, which is not displayed in the 
output, a choice must be made as to whether to select the left tail or the right tail at the 
observed value ¢* = 1.65. In Exact Tests, this decision is made by selecting the tail 
area with the smaller probability. Thus, the asymptotic one-sided p value is calculated as 


Py = min{®@(t*), 1 —@(z*)} Equation 12.5 
where ®(z) is the tail area from —° to z under a standard normal distribution. In the 
present example, it is the right tail area that is the smaller of the two, so that the asymp- 
totic one-sided p value is evaluated as the normal approximation to Pr(7* > 1.65), 
which works out to 0.0490. The asymptotic two-sided p value is defined as double the 
one-sided: 


P> = 2p, = 0.0994 Equation 12.6 
Since the square of a standard normal variate is a chi-square variate with one degree of 
freedom, an equivalent alternative way to compute the asymptotic two-sided p value is 
to evaluate the tail area to the right of (1.65) from a chi-square distribution with one 
degree of freedom. It is easy to verify that this too will yield 0.099 as the asymptotic 
two-sided p value. 

The exact one-sided p value is computed as the smaller of two permutational 
probabilities: 


p, = min{Pr(7* < 1.65), Pr(7* > 1.65)} Equation 12.7 
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In the present example, the smaller permutational probability is the one that evaluates 
the right tail. It is displayed on the screen as Pr(7* > 1.65) = 0.049. The exact one- 
sided p value is the point probability Pr(7” = 1.65) . This probability, 0.000, is a natural 
measure of the discreteness of the test statistic. Some statisticians advocate subtracting 
half its value from the exact p value, thereby yielding a less conservative mid-p value. 
(See Lancaster, 1961; Pratt and Gibbons, 1981; and Miettinen, 1985 for more 
information on the role of the mid-p value in exact inference.) Equation 12.8 defines the 
exact two-sided p value 


py = Pr(|7*| > 1.648) = 0.100 Equation 12.8 
Notice that this definition will produce the same answer as Equation 9.4, with 
D(y) = (T*(y)y forall ye TL. 

Sometimes the data set is too large for an exact analysis, and the Monte Carlo method 
must be used instead. Figure 12.4 displays an unbiased estimate of the exact one- and 
two-sided p value for the Jonckheere-Terpstra test based on a crude Monte Carlo sample 
of 10,000 tables from the reference set. 


Figure 12.4 Monte Carlo results for Jonckheere-Terpstra test for dose-response data 


Jonckheere-Terpstra Test! 


Monte Carlo Sig. (2-tailed) Monte Carlo Sig. (1-tailed) 
Number 3 " 9 - 
of Std. 99% Confidence 99% Confidence 
Levels Observed Mean Deviation Asymp. Interval Interval 
in Drug J-T J-T of J-T Std. J-T Sig. Lower Upper Lower Upper 
Dose N Statistic | Statistic | Statistic | Statistic | (2-tailed) Sig. Bound Bound Sig. Bound Bound 
TOXICITY 4 227 | 9127.000 | 8827.500 181.760 1.648 .099 101? 093 109 0517 045 .057 


|. Grouping Variable: Drug Dose 
2. Based on 10000 sampled tables with starting seed 2000000. 


The Monte Carlo point estimate of the exact one-sided p value is 0.051, which is very 
close to the exact one-sided p value of 0.049. Moreover, the Monte Carlo method also 
produces a confidence interval for the exact p value. Thus, although this point estimate 
might change slightly if you resample with a different starting seed or a different random 
number generator, you can be 99% confident that the exact p value is contained in the 
interval 0.045 to 0.057. The Monte Carlo point estimate of the exact two-sided p value 
is 0.101, and the corresponding 99% confidence interval is 0.093 to 0.109. More tables 
could be sampled from the reference set to further narrow the widths of these intervals. 
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Linear-by-Linear Association Test 


The linear-by-linear association test orders the tables in T according to the linear rank 
statistic. Thus, if the observed table is x, the unnormalized test statistic is 


r Cc 
LL(x) = >» a Equation 12.9 
j=lj=l 


where u,i=1,2,...,7 are arbitrary row scores, and v,j=1,...,c are arbitrary 
column scores. Under the null hypothesis of no row-by-column interaction, the linear- 
by-linear statistic has a mean of 


r Cc 
E(LL(X)) = Nn” oy uj;m,; ye Vin; Equation 12.10 
i=l j=l 


and a variance of 


2 2 
var(LL(X)) = (N- lo ie 3 usm, _ cae | pay = se. 
i J 


Equation 12.11 


See Agresti (1990) for more information. The asymptotic distribution of 


LEXY S BEE EEE) Equation 12.12 
Avar(LL(X)) 


is normal, with a mean of 0 and a variance of 1, where LL* denotes the standardized 
version of LL. The square of the normalized statistic is distributed as chi-square with one 
degree of freedom. 

Next, run the linear-by-linear association test on the dose-response data shown in 
Figure 12.1. The results are shown in Figure 12.5. 
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Figure 12.5 Results of linear-by-linear association test 


Chi-Square Tests 


Asymp. 
Sig. Exact Sig. |Exact Sig. Point 
df (2-tailed) | (2-tailed) | (1-tailed) | Probability 
EIDEAN Dy MEAT 292645 071 079 044 012 
Association 


2. Standardized stat. is 1.807 ... 


The upper portion of the output displays the asymptotic two-sided p value. The p values 
are evaluated as tail areas under a chi-square distribution. The standardized value for the 
linear-by-linear association test is LL" = 1.807. This value is normally distributed with 
a mean of 0 and a variance of 1. The chi-square value, 3.264, is the square of this 
standardized value. The asymptotic two-sided p value is calculated under a chi-square 
distribution. 

The exact one- and two-sided p values are also displayed in the output. The exact 
one-sided p value is computed as the smaller of two permutational probabilities: 


Pp, = min{(Pr|LL*(y) $ 1.807|y € 1), Pr(LL*(y) 2 1.807|yeT)} Equation 12.13 


In the present example, the smaller permutational probability is the one that evaluates 
the right tail. This value is 0.044. The exact one-sided p value is the point probability 
Pr(LL*(X) = 1.807) . This probability, 0.012, is a natural measure of the discreteness 
of the test statistic. Some statisticians advocate subtracting half its value from the exact 
p value, thereby yielding a less conservative mid-p value. (For more information on the 
role of the mid-p method in exact inference, see Lancaster, 1961; Pratt and Gibbons; 
1981, and Miettinen, 1985.) In Equation 12.14, the point probability is the exact two- 
sided p value 


P> = Pr(|LL*(X)| = 1.807) = 0.0792 Equation 12.14 


Notice that this definition will produce the same answer as Equation 9.4, with 
Diy) = (LL*(y)) forallye T. 

Sometimes the data set is too large for an exact analysis, and the Monte Carlo method 
must be used instead. Figure 12.6 displays an unbiased estimate of the exact one- and 
two-sided p values for the linear-by-linear association test based on a crude Monte Carlo 
sample of 10,000 tables from the reference set. 
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Figure 12.6 Monte Carlo results for linear-by-linear association test 


Chi-Square Tests 


Monte Carlo Significance Monte Carlo Significance 
(2-tailed) (1-tailed) 
99% Confidence 99% Confidence 
Asymp. Interval Interval 
Sig. Lower Upper Lower Upper 
Value df (2-tailed) Sig. Bound Bound Sig. Bound Bound 
i -by-Li 3 2 2 
Linea py nest 3.264 1 071 081 073 088 046 040 051 


2. Based on 10000 and seed 2000000 ... 
3. Standardized stat. is 1.807 ... 


The Monte Carlo point estimate of the exact one-sided p value is 0.046, which is very 
close to the exact one-sided p value of 0.044. Moreover, the Monte Carlo method also 
produces a confidence interval for the exact p value. Thus, although this point estimate 
might change slightly if you resample with a different starting seed or a different random 
number generator, you can be 99% confident that the exact p value is contained in the 
interval 0.040 to 0.051. The Monte Carlo point estimate of the exact two-sided p value 
is 0.081, and the corresponding 99% confidence interval is 0.073 to 0.088. More tables 
could be sampled from the reference set to further narrow the widths of these intervals. 
One important advantage of the linear-by-linear association test over the Jonckheere- 
Terpstra test is its ability to specify arbitrary row and column scores. Suppose, for 
example, that you want to penalize the greater toxicity levels by greater amounts 
through the unequally spaced scores (1, 3, 9, 27). The crosstabulation of the new data is 
shown in Figure 12.7. 


Figure 12.7 Drug dose data penalized at greater toxicity levels 


Drug Dose * TOXICITY Crosstabulation 


Count 
TOXICITY 
Mild Severe 
1 3 9 27 
Drug | 100 mg 100 1 
Dose | 500 mg 18 1 1 
300 mg 50 1 1 1 
1 


400 mg 50 
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Figure 12.8 shows the results of the linear-by-linear association test on these scores. 
Figure 12.8 Results of linear-by-linear association test on adjusted data 


Chi-Square Tests 


Asymp. 
Sig. Exact Sig. |Exact Sig. Point 
Value df (2-tailed) | (2-tailed) | (1-tailed) | Probability 
i -by-Li 2 
Piece es 3.008 1 083 078 050 005 
Association 


2. Standardized stat. is 1.734 ... 


Observe now that the one-sided asymptotic p value is 0.042, (0.083)/2 , which is statis- 
tically significant, but that the one-sided exact p value (0.050) is not statistically signif- 
icant at the 5% level. Inference based on asymptotic theory, with a rigid 5% criterion for 
claiming statistical significance, would therefore lead to an incorrect conclusion. 


Measures of Association 


This chapter introduces some definitions and notation needed to estimate, test, and 
interpret the various measures of association computed by Exact Tests. The methods 
discussed here provide the necessary background for the statistical procedures 
described in Chapter 14, Chapter 15, and Chapter 16. 

Technically, there is a distinction between an actual measure of association, regarded 
as a population parameter, and its estimate from a finite sample. For example, the 
correlation coefficient p is a population parameter in a bivariate normal distribution, 
whereas Pearson’s product moment coefficient R is an estimate of p, based on a finite 
sample from this distribution. However, in this chapter, the term “measure of association” 
will be used to refer to either a population parameter or an estimate from a finite sample, 
and it will be clear from the context which is intended. In particular, the formulas for the 
various measures of association discussed in this chapter refer to sample estimates and 
their associated standard errors, not to underlying population parameters. Formulas are 
not provided for the actual population parameters. For each measure of association, the 
following statistics are provided: 


e A point estimate for the measure of association (most often this will be the maxi- 
mum-likelihood estimate [MLE]). 


e Its asymptotic standard error, evaluated at the maximum-likelihood estimate 
(ASE1). 


e Asymptotic two-sided p values for testing the null hypothesis that the measure of 
association is 0. 


e Exact two-sided p values (possibly up to Monte Carlo accuracy) for testing the null 
hypothesis that the measure of association is 0. 


Representing Data in Crosstabular Form 


All of the measures of association considered in this book are defined from data that 
can be represented in the form of the r x c contingency table, as shown in Table 13.1. 
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Table 13.1 Observed r x c contingency table 


Row Column Number Row 
Row Scores 
Number Col1 Col2 ... Cole Totals 
Row_1 xy X19 Aes X16 mM, uy 
Row_2 X51 X49 tee X46 mM, Uy 
Row_r x,y xX ,2 X no m,. U,. 
Col_Totals n, Ny fie nN, N 
Col_Scores v, Vy ,, 


This r x c table is formed from N observations cross-classified into row categories (r) 
and column categories (c), with x,, of the observations falling into row category 7 and 
column category /. Such a table is appropriate for categorical data. For example, the row 
classification might consist of three discrete age categories (young, middle-aged, and 
elderly), and the column classification might consist of three discrete annual income cat- 
egories ($25,000—50,000, $50,000—75000, and $75,000—100,000). These are examples 
of ordered categories. Alternatively, one or both of the discrete categories might be nom- 
inal. For example, the row classification might consist of three cities (Boston, New 
York, and Philadelphia). In this chapter, you will define various measures of association 
based on crosstabulations such as the one shown in Table 13.1. 

Measures of association are also defined on data sets generated from continuous 
bivariate distributions. Although such data sets are not naturally represented as 
crosstabulations, it is nevertheless convenient to create artificial crosstabulations from 
them in order to present one unified method of defining and computing measures of 
association. To see this, let A,B represent a pair of random variables following a 
bivariate distribution, and let {(a,,b,), (4,5), ...(ay,.by)} be N pairs of observations 
drawn from this bivariate distribution. The data may contain ties. Moreover, the 
original data might be replaced by rank scores. To accommodate these possibilities, let 
(u,; <uy<...<u,) ber distinct scores assumed by the A component of the data series, 
sorted in ascending order. The u;’s might represent the raw data, the data replaced by 
ranks, or the raw data replaced by arbitrary scores. When there are no ties, 7 will equal 
N. Similarly, let (v; <v.<...<v,) be c distinct scores assumed by the B component 
of the data series. Now the bivariate data can be cross-classified into an rxc 
contingency table such as Table 13.1, with u; as the score for row i and v, as the score 
for column /. 

For example, consider the bivariate data set shown in Figure 13.1. This data set is 
adapted from Siegel and Castellan (1988) with appropriate alterations to illustrate the 
effect of ties. The original data are shown in Chapter 14. Each subject was measured on 
two scales—authoritarianism and social status striving—and the goal was to estimate 
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the correlation between these two measures. Figure 13.1 shows the data displayed in the 
Data Editor. Author contains subjects’ measurements on the authoritarianism scale, and 
status contains subjects’ measurements on the social status striving scale. Figure 13.2 
shows the same data set crosstabulated as a 5 x 5 contingency table. 


Figure 13.1 Bivariate data set 


subject author social 


Figure 13.2 Crosstabulation of bivariate data set 


Authoritarianism * social status striving Crosstabulation 
Count 


social status strivin 


Authoritarianism | 40 
82 
87 1 
111 
113 


The original data consist of N = 8 pairs of observations. These data are replaced by an 
equivalent contingency table. Because these data contain ties, the contingency table is 
5 x5 instead of 8 x 8 . Had the data been free of ties, every row and column sum would 
have been unity, and the equivalent contingency table would have been 8 x 8 . In this 
sense, the contingency table is not a natural representation of paired continuous data, 
since it can artificially expand N bivariate pairs into an NXWN rectangular array. 
However, it is convenient to represent the data in this form, since it provides a consistent 
notation for defining all of the measures of association and related statistics that you will 
be estimating. 
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Point Estimates 


Maximum-likelihood theory is used to estimate each measure of association. For this pur- 
pose, Table 13.1 is constructed by taking NV samples from a multinomial distribution and 
observing counts x,; in cells (7,7) with the probability 1,,, where 2; ;1,; = 1. Measures 
of association are functions of these cell probabilities. A maximum-likelihood estimate 
(MLE) is provided for each measure, along with an asymptotic standard error (ASE1) 
evaluated at the MLE. All of the measures of association defined from ordinal data in 
Chapter 14 and all of the measures of agreement in Chapter 16 fall in the range of —1 to 
+1, with 0 implying that there is no association, —1 implying a perfect negative associa- 
tion, and +1 implying a perfect positive association. 

All of the measures of association defined from nominal data in Chapter 15 fall in 
the range of 0 to 1, with 0 implying that there is no association and | implying perfect 
association. 


Exact P Values 


Exact p values are computed by the methods described in Chapter 9. First, the reference 
set, I; is defined to be all r x c tables with the same margins as the observed table, as 
shown in Equation 9.1. Under the null hypothesis that there is no association, each table 
yéT has the hypergeometric probability P(yv) , given by Equation 9.2. Then each ta- 
ble ye T isassigned a value M(y) corresponding to the measure of association being 
investigated. 


Nominal Data 


For measures of association on nominal data, only two-sided p values are defined. The 
exact two-sided p value is computed by Equation 9.4, with M(y) substituted for D(y) . 
Thus, 


P= oy P(y) = Pr{M(y) = M(x) } Equation 13.1 
My) 2 M(x) 


Ordinal and Agreement Data 


For measures of association based on ordinal data and for measures of agreement, only 
two-sided p values are defined. Now M(y) is a univariate test statistic ranging between 
—| and +1, with a mean of 0. A negative value for M(yv) implies a negative association 
between the row and column variables, while a positive value implies a positive associa- 
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tion. The exact two-sided p value is obtained by Equation 9.4, with M : (y) substituted for 


D(y) . Thus, 
Pp= > PY)= Pr{M-(y) > M(x)} Equation 13.2 
Mv) 2 M(x) 


An equivalent definition of the two-sided p value is 


P= > P(y) = Pr{|M(y)| = |M(x)|} Equation 13.3 
IM(y)| 2 |M(x)| 


This definition expresses the exact two-sided p value as a sum of two exact one-sided p 
values, one in the left tail and the other in the right tail of the exact distribution of M(y). 
Exact permutational distributions are not usually symmetric, so the areas in the two tails 
may not be equal. This is an important distinction between exact and asymptotic p 
values. In the latter case, the exact two-sided p value is always double the exact one- 
sided p value by the symmetry of the asymptotic normal distribution of M(y) . 


Monte Carlo P Values 


Monte Carlo p values are very close approximations to corresponding exact p values but 
have the advantage that they are much easier to compute. These p values are computed 
by the methods described in Chapter 9 in “Monte Carlo Two-Sided P Values” on p. 143. 
For nominal data, only two-sided p values are defined. The Monte Carlo estimate of the 
exact two-sided p value is obtained by Equation 9.6, with an associated confidence 
interval given by Equation 9.8. In this computation, the critical region I’* is defined by 


Mm = {ye P:M(y)2M(x)} Equation 13.4 


For measures of association based on ordinal data and for measures of agreement, two- 
sided p values are defined. For two-sided p values, 


Mm = {ye T:|M(y)| <|M@)I} Equation 13.5 


Asymptotic P Values 


For measures of association based on nominal data, only two-sided p values are defined. 
These p values are obtained as tail areas of the chi-square distribution with 
(r—1)(c-1) degrees of freedom. 
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For measures of association on ordinal data and for measures of agreement, the asymptotic 
standard error of the maximum-likelihood estimate under the null hypothesis (ASEO) is 
obtained. Then asymptotic one- and two-sided p values are obtained by using the fact that the 
ratio M(x)/ASEO converges to a standard normal distribution. 


Measures of Association for 
Ordinal Data 


Exact Tests provides the following measures of association between pairs of ordinal 
variables: Pearson’s product-moment correlation coefficient, Spearman’s rank-order 
correlation coefficient, Kendall’s tau coefficient, Somers’ d coefficient, and the gamma 
coefficient. All of these measures of association range between —1 and +1, with 0 
signifying no association, —1 signifying perfect negative association, and +1 signifying 
perfect positive association. One other measure of association mentioned in this chapter 
is Kendall’s W, also known as Kendall’s coefficient of concordance. This test is 
discussed in detail in Chapter 7. 


Available Measures 


Table 14.1 shows the available measures of association, the procedure from which each 
can be obtained, and a bibliographical reference for each test. 


Table 14.1 Available tests 


Measure of Association Procedure Reference 
Pearson’s product-moment Crosstabs Siegel and Castellan (1988) 
correlation 
Spearman’s rank-order Crosstabs Siegel and Castellan (1988) 
correlation 
Kendall’s W Nonparametric Tests: Tests for | Conover (1975) 

Several Related Samples 
Kendall’s tau-b, Kendall’s tau-c, |Crosstabs Siegel and Castellan (1988) 
and Somers’ d 
Gamma coefficient Crosstabs Siegel and Castellan (1988) 
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Pearson’s Product-Moment Correlation Coefficient 


Let A and B be a pair of correlated random variables. Suppose you observe N pairs of 
observations {(a, b,)(d, b5)...(ay, by)} and crosstabulate them into the r xc 
contingency table displayed as Table 13.1, in which the u;’s are the distinct values 
assumed by A and the v,’s are the distinct values assumed by B. When the data follow 
a bivariate normal distribution, the appropriate measure of association is the correlation 
coefficient, p, between A and B. This parameter is estimated by Pearson’s product- 
moment correlation coefficient, shown in Equation 14.1. In this equation, m, represents 
the marginal row total and n, represents the marginal column total. 


r Cc 


oye x; = 1; = 1X44 — OO; -¥) 


SS Equation 14.1 
r a) c ied, 
E = |N,(u,; - iu) AP - gay) 


where 
r Cc 
“= » m,u,/N and ¥ = 2 Equation 14.2 
peal i= 


The formulas for the asymptotic standard errors are fairly complicated. These formulas 
are discussed in the algorithms manual available on the Manuals CD and also available 
by selecting Algorithms on the Help menu. 

You now compute Pearson’s product-moment correlation coefficient for the first 
seven pairs of observations of the authoritarianism and social status striving data 
discussed in Siegel and Castellan (1988). The data are shown in Figure 14.1. Author 
contains subjects’ scores on the authoritarianism scale, and social contains subjects’ 
scores on the social status striving scale. 


Figure 14.1 Subset of social status striving data 


subject author social 
1 82 42 
2 98 46 
3 87 39 
4 40 37 
5 116 65 
6 113 88 
7 111 86 


The results are shown in Figure 14.2 
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Figure 14.2 Pearson’s product-moment correlation coefficient for subset of social status 
striving data 


Symmetric Measures 


Asymp. Approx. Exact 
Value Std. Error | Approx. T Sig. Significance 
: 1 
Interval by Interval oon ‘Ss 739 054 2.452 058 037 
N of Valid Cases 7 


1. Based on normal approximation 


The correlation coefficient has a point estimate of R = 0.739. The exact two-sided p 
value is 0.037 and indicates that the correlation coefficient is significantly different from 
0. The corresponding asymptotic two-sided p value is 0.058 and fails to demonstrate 
statistical significance at the 5% level for this small data set. 

It should be noted that the computational limits for exact inference are reached rather 
quickly for Pearson’s product-moment correlation coefficient with continuous data. By 
the time NV = 10, the Monte Carlo option should be used rather than the exact option. 
Consider, for example, the complete authoritarianism data set of 12 observations (Siegel 
and Castellan, 1988) shown in Figure 14.3. 


Figure 14.3 Complete social status striving data 


subject author | social 
1 82 42 
2 98 46 
3 87 39 
4 40 37 
5 116 65 
6 113 88 
7 111 86 
8 83 56 
9 85 62 
10 126 92 
11 106 54 
12 117 31 


For this data set, the exact two-sided p value, shown in Figure 14.5, is 0.001, 
approximately half the asymptotic two-sided p value of 0.003. However, it may be time- 
consuming to perform the exact calculation. In contrast, the Monte Carlo p value based 
on 10,000 samples from the data set produces a significance estimate of 0.002, 
practically the same as the exact p value. The 99% confidence interval for the exact p 


178 


Chapter 14 


value is (0.001, 0.003). The Monte Carlo output is shown in Figure 14.4, and the 
corresponding exact output is shown in Figure 14.5. 


Figure 14.4 Correlations for complete social status striving data using the Monte Carlo 
method 


Symmetric Measures 


Monte Carlo Sig. 
99% Confidence 


Interval 
Asymp. Approx. | Approx. Lower Upper 
Value Std. Error T Sig. Sig. Bound Bound 
Interval by Interval] Pearson's R 175 .060 3.872 003" .0022 .001 .003 
N of Valid Cases 12 


1. Based on normal appoximation. 
2. Based on 10000 sampled tables with starting seed of 2000000. 


Figure 14.5 Exact results for correlations for complete social status striving data 


Symmetric Measures 


Asymp. 
Value Std. Error | Approx. T | Approx. Sig. | Exact Sig. 


Interval by Interval] Pearson's R A715 .060 .001 
N of Valid Cases 12 


1. Based on normal approximation. 


Spearman’s Rank-Order Correlation Coefficient 


If you are reluctant to make the assumption of bivariate normality, you can use Spear- 
man’s rank-order correlation coefficient instead of Pearson’s product-moment correlation 
coefficient. The only difference between the two measures of association is that Pearson’s 
measure uses the raw data, whereas Spearman’s uses ranks derived from the raw data. 
Specifically, if the data are represented in the crosstabular form of Table 13.1, Pearson’s 
measure uses the raw data as the u; and v, scores, while Spearman’s measure uses 


= 
Il 


m,+m,+...+m,;_,+(m,+1)/2 Equation 14.3 


fori = 1,2,...r, and 


< 
I 


7 nytt... +n;_yt+(njt+1)/2 Equation 14.4 
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for j = 1,2,...c. Once these transformations are made, all of the remaining 
calculations for the point estimate (R), the standard error (ASE1), the confidence 
interval, the asymptotic p value, and the exact p value are identical to corresponding 
ones for Pearson’s product-moment correlation coefficient. 

Consider, for example, the data displayed in Figure 13.1. Figure 14.6 displays these 
data with their ranks. Variable rauthor contains the ranks for author, the authoritarianism 
scores, and variable rsocial contains the ranks for social, the social status striving scores. 


Figure 14.6 Raw data and rank scores for eight-case subset of social status striving data 


subject author rauthor social rsocial 

3.0 
2 2 87 45 46 3.0 
a 3 87 4.5 39 1.0 
4 4 40 1.0 56 5.0 
5 5 111 6.5 65 6.0 
8 6 113 8.0 88 7.5 
if 7 111 6.5 88 7.5 
8 8 82 2.5 46 3.0 

Notice that tied ranks have been replaced by mid-ranks. These same rank scores could 


be obtained by crosstabulating author with social, and applying Equation 14.3 and Equa- 
tion 14.4. The crosstabulation of the rank scores is shown in Figure 14.7. 


Figure 14.7 Crosstabulation of rank scores for eight-case subset of social status striving data 


RANK of AUTHOR * RANK of SOCIAL Crosstabulation 


Count 


RANK of SOCIAL 


RANK of 
AUTHOR 


Figure 14.8 shows the point and interval estimates for Spearman’s correlation coeffi- 
cient for these data. The exact and asymptotic p values for testing the null hypothesis 
that there is no correlation are also shown. 
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Figure 14.8 Exact results for Spearman’s correlation coefficient for eight-case subset of social 
status striving data 


Symmetric Measures 


Asymp. Approx. | Approx. Exact 
Value | Std. Error T Sig. Sig. 


Ordinal by Ordinal |} Spearman Correlation 594 309 1.808 1211 125 
N of Valid Cases 8 


1. Based on normal approximation. 


The Spearman rank-order correlation coefficient has a point estimate of R = 0.594. 
The exact two-sided p value is evaluated by Equation 9.4, as discussed in “Exact P 
Values” on p. 172 in Chapter 13. Its value is 0.125 and indicates that the correlation 
coefficient is not significantly different from 0. The corresponding asymptotic two- 
sided p value was 0.121. 

As the number of paired observations grows, it becomes increasingly difficult to 
compute exact p values (i, /), and the Monte Carlo option is a better choice. Figure 14.9 
shows the Monte Carlo results for the larger data set of 12 pairs of observations in Figure 
14.3. The Monte Carlo sample size was 10,000. There is practically no difference 
between the Monte Carlo and exact p values. 


Figure 14.9 Monte Carlo results for Spearman’s correlation coefficient for complete social 
status striving data 


Symmetric Measures 


Monte Carlo Sig. 


99% Confidence 
Interval 


Asymp. Approx. | Approx. 
Std. Error T Sig. 


Ordinal by Ordinal | Spearman Correlation . ' 4.500 0011 
N of Valid Cases 


1. Based on normal approximation. 
2. Based on 100000 and seed 2000000. 
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Kendall’s W 


All of the measures of association in this chapter are formed from a sequence of paired 
observations. Sometimes, however, the data consist of K > 2 related samples rather than 
just two related samples. Kendall’s W, also known as Kendall’s coefficient of concor- 
dance, is a measure of association specially developed for this situation. It bears a close 
relationship to Spearman’s rank-order correlation coefficient. For K > 2 related samples 
of data, you could form K!/2!(K —2)! distinct pairs of samples, and each pair would 
yield a value for Spearman’s rank-order correlation coefficient. Let ave(R,) denote the 
average of all these Spearman correlation coefficients. Then you can show that, if there 
are no ties in the data, 


KW-1 


ave(Ry) = Equation 14.5 


Kendall’s W is discussed in greater detail in Chapter 7, in the section “Kendall’s W” on 
p. 106, where a numerical example is also provided. 


Kendall’s Tau and Somers’ d Coefficients 


Kendall’s tau and Somers’ d coefficients are alternatives to Pearson’s product-moment 
correlation coefficient and Spearman’s rank-order correlation coefficient for ordinal 
data. The main distinction between these measures and Pearson’s or Spearman’s 
measures is that you can compute the former without specifying numerical values for 
the row scores, wu; , or the column scores, Vv; All that is needed is an implicit ordering of 
the data. On the other hand, Equation 14.1, Equation 14.3, and Equation 14.4 relate the 
row and column scores explicitly to the computation of Pearson’s and Spearman’s 
coefficients. 

Suppose that you have observed the r x c contingency table displayed as Table 9.1. 
Kendall’s tau and Somers’ d are both based on the difference between concordant and 
discordant pairs of observations in this contingency table. Since the rows and columns 
of the contingency table are ordered, the location of any cell (4,4) relative to any other 
cell (i,j) determines whether the observations in the two cells are concordant or 
discordant. For example, if h<i and k<j, both members of a paired observation 
falling in cell (4,4) are smaller than the corresponding members of the paired 
observation falling in cell (i,/). Thus, the two pairs are concordant. On the other hand, if 
h<i and k>j, the first member of the (/,x) pair is smaller, while the second member 
is larger than corresponding members of the (i,/) pair. The formula 


Ci = yet yy Equation 14.6 


h<ik<j h>ik>j 
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defines the number of pairs of observations that are concordant relative to the observa- 
tions in cell (i, /), and the formula 


Dy = ys YxnKt > yak Equation 14.7 


h<ik>j h>ik<j 


defines the number of pairs of observations that are discordant relative to the observa- 
tions in cell (i, /). Thus, the total number of concordant pairs in the entire data set is 


r Cc 


Ya » iii Equation 14.8 


i=lj=l 


and the total number of discordant pairs in the entire data set is 
r Cc 
O= y by x,Djj Equation 14.9 


Kendall’s tau and Somers’ d and their various variants are functions of P— Q. Thus, 
although their respective point estimates and standard errors differ, they all produce the 
same p values. Next, these measures of association will be defined and their use 
illustrated through a numerical example. 


Kendall’s Tau-b and Kendall’s Tau-c 


Kendall’s tau coefficient has three variants, T, T,, and T,. You first specify estimators 
and associated asymptotic standard errors for these three variants. For a discussion of 
the criteria for selecting one variant over another, see Gibbons (1993). The t, and T, 
variants were developed to correct for ties and for categorical data. 


Kendall’s t, coefficient is estimated by 


T, Equation 14.10 


ge ess 8 
VPP. 


where 


D,= Nee DR Equation 14.11 
i= 


Somers’ d 
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Cc 
D. = Ns vy Equation 14.12 
j=l 


Kendall’s Tt, coefficient is estimated by 


rp - gP-Q) Equation 14.13 
* N*q=1) 


where g = min(7,c). 


Somers’ d coefficient is a useful measure of association between two asymmetrically 
related ordinal variables, where one of the two variables is regarded as independent and 
the other as dependent. See Siegel and Castellan (1988) for a discussion of this 
asymmetry. Somers’ d has three variants; one with the row variable U as the independent 
variable, one with the column variable V as the independent variable, and a symmetric 
version. The row-independent version of Somers’ d is 


P-Q 
D 


r 


Dyyy = Equation 14.14 


The column-independent version of Somers’ d is 


P-Q 
D 


Cc 


Duyy = Equation 14.15 


The symmetric version of Somers’ d is 


D= par has Deer Equation 14.16 
(.5)(D,.+D,) 
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Example: Smoking Habit Data 


Observe that all variants of Kendall’s tau and Somers’ d are functions of P— Q. They 
differ only in how they are standardized. Thus, although their point estimates and 
asymptotic standard errors vary, the exact and asymptotic p values for testing the null 
hypothesis that there is no association are invariant across all these measures. Consider 
the crosstabulation shown in Figure 14.10 for the status of the smoking habit and the 
length of the smoking habit. This data set was extracted from Siegel and Castellan 
(1988). For convenience, only 96 subjects with a smoking habit between 10 and 25 years 
in duration have been considered. The variables in the table are status, which indicates 
the status of the smoking habit, with three categories (successful quitter, in-process 
quitter, and unsuccessful quitter), and years, which indicates the duration of the smoking 
habit. 


Figure 14.10 Crosstabulation of cessation and years of smoking for subset of data 


Count 
Years of Smoking Habit 
10to14 | 15to19 | 20 to 25 
Status of Successful 
Smoking Quitter 
Habit 


In-process 


Quitter 


Unsuccessful 
Quitter 


Figure 14.11 shows the results for the Kendall’s tau-b, Kendall’s tau-c, and all three vari- 
ants of Somers’ d for these data. The exact and asymptotic p values for testing the null 
hypothesis that there is no correlation are also shown. 


Figure 14.11 Kendall’s tau and Somers’ d for subset of smoking data 


Directional Measures 


Asymp. 1 

Std. Error | Approx. Tr 
Ordinal by Somers' d | Symmetric .091 2.372 
Ordinal 


Status of Smoking 


Habit Dependent 2.372 


Years of Smoking 
Habit Dependent 


2.372 


1. Not assuming the null hypothesis. 
2. Using the asymptotic standard error assuming the null hypothesis. 
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Figure (Continued) 


Symmetric Measures 


Asymp. 
Value Std. Error’ Approx. tC Approx. Sig. | Exact =<. 


Ordinal by Kendall's tau-b 215 2.372 .018 
Ordinal Kendall's tau-c 194 082 2.372 018 023 
N of Valid Cases 96 


1. Not assuming the null hypothesis. 
2. Using the asymptotic standard error assuming the null hypothesis. 


Although all of these coefficients have different point estimates, their sampling 
distributions are equivalent, thus leading to a common p value. The exact two-sided p 
value for testing the null hypothesis that there is no association is 0.0226, and the 
corresponding asymptotic two-sided p value is 0.0177. 

As the number of observations grows, it becomes increasingly difficult to compute 
exact p values, and the Monte Carlo option is a better choice. Figure 14.12 shows the 
data for all 240 subjects who participated in the cessation of smoking study (Siegel and 
Castellan, 1988). 


Figure 14.12 Full data set for cessation and years of smoking 


Status of Smoking Habit * Years of Smoking Habit Crosstabulation 


Count 
Years of Smoking Habit 

Status of Smoking Habit 1 2-4 5-9 | 10-14 | 15-19 20-25 > 25 
Status of | Successful Quitter 13 29 26 22 9 8 
Smoking In-Process Quitter 5 2 6 2 1 3 
Habit 

Unsuccessful Quitter 1 9 16 14 21 16 29 
Total 19 40 48 38 31 27 37 
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Figure 14.13 shows the Monte Carlo results for the full data set. The Monte Carlo sample 
size was 10,000. 


Figure 14.13 Monte Carlo results for Kendall’s tau and Somers’ d for full smoking data 


Directional Measures 


Monte Carlo Sig. 
99% Confidence 
Interval 
Asymp. ; Approx. Lower Upper 
Value | Std. Error | Approx. r Sig. Sig. Bound Bound 
Ordinal by | Somers'd | Symmetric 338 .046 7.339 .000 .0003 .000 .000 
Ordinal , 3 
SHAS CE SMOKING)) «599 038 7.339 000] 000 000 000 
Habit Dependent 
Years of Smoking 3 
Habit Dependent 420 .058 7.339 .000 .000 .000 .000 
1. Not assuming the null hypothesis. 
2. Using the asymptotic standard error assuming the null hypothesis. 
3. Based on 10000 sampled tables with starting seed 2000000. 
Symmetric Measures 
Monte Carlo Sig. 
99% Confidence 
Interval 
Asymp. F Approx. Lower Upper 
Value | Std. Error | Approx. Sig. Sig. Bound Bound 
Ordinal by Kendall's tau-b 344 .047 7.339 .000 .0003 .000 .000 
Crane Kendall's tau-c | .359 049 7.339|  .000) 000% 000] — .000 
N of Valid Cases 240 


1. Not assuming the null hypothesis. 
2. Using the asymptotic standard error assuming the null hypothesis. 


3. Based on 10000 sampled tables with starting seed 2000000. 


It is clear that a strong correlation exists between the duration and status of the smoking 
habit. The exact two-sided p value for testing the null hypothesis that there is no 
correlation is at most 0.0003 with 95% confidence. 
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Gamma Coefficient 


The gamma coefficient is yet another measure of association between two ordinal 
variables. It was first discussed extensively by Goodman and Kruskal (1963). It is an 
alternative to Kendall’s tau and Somers’ d for ordered categorical variables. Like these 
measures, it is defined in terms of the difference between concordant and discordant 
pairs, and so does not require the variables to take on actual numerical values. Using the 
notation developed in the previous section, the gamma coefficient is estimated by 


Ge kiiras Equation 14.17 
P+Q 


If the data contain no ties, this definition of gamma will yield the same exact and 
asymptotic p values as Kendall’s tau and Somers’ d. In general, however, inference based 
on gamma can differ from inference based on the latter two coefficients. You can now 
analyze the small data set of cessation and smoking habit displayed in Figure 14.10. Figure 
14.14 displays point and interval estimates of gamma along with exact and asymptotic p 
values for testing the null hypothesis that there is no association. 


Figure 14.14 Gamma coefficient for subset of smoking data 


Symmetric Measures 


Asymp. Approx. Exact 
Value Std. Error | Approx. T Sig. Significance 
Ordinal by Ordinal Gamma 345 -140 2.372 .018 .024 


N of Valid Cases 96 


The gamma coefficient is estimated as 0.345. The exact two-sided p value for testing the 
null hypothesis that there is no association is 0.024. 

As the number of observations grows, it becomes increasingly difficult to compute 
exact p values, and the Monte Carlo option is a better choice. Figure 14.15 shows the 
Monte Carlo results for the full cessation and smoking habit data set shown in Figure 
14.12. The Monte Carlo sample size was 10,000. 
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Figure 14.15 Monte Carlo results for gamma coefficient for full smoking data 


Symmetric Measures 


Monte Carlo Sig. 
99% Confidence 
Interval 
Asymp. Approx. Lower Upper 
Value | Std. Error’ Approx. r Sig. Sig. Bound Bound 
Ordinal by Ordinal | Gamma 483 .064 7.339 .000 .000% .000 .000 
N of Valid Cases 240 


1. Not assuming the null hypothesis. 
2. Using the asymptotic standard error assuming the null hypothesis. 
3. Based on 10000 sampled tables with starting seed 2000000. 


It is clear that a strong correlation exists between the duration and status of the smoking 
habit. The exact two-sided p value for testing the null hypothesis that there is no 
correlation is at most 0.0005 with 99% confidence. 


Measures of Association 
for Nominal Data 


Measures of association for nominal data are defined on r X c contingency tables like 
Table 13.1. However, these measures do not depend on the particular order in which the 
rows and columns are arranged, nor do they depend on row and column scores. Inter- 
changing two rows or two columns does not alter these measures of association. Exact 
Tests provides the following measures of association between pairs of nominal categor- 
ical variables: 


Contingency Coefficients. These coefficients are derived from the Pearson chi-square 
statistic. They include the Pearson coefficient, Cramér’s V coefficient, and the phi 
coefficient. 


Proportional Reduction in Prediction Error. Goodman and Kruskal’s tau and the 
uncertainty coefficient are measures for assessing the power of one variable to predict 
the classification of members of the population with respect to a second variable. 

These measures of association range between 0 and 1, with 0 signifying no associa- 
tion and 1| signifying perfect association. 


Available Measures 


Table 15.1 shows the available tests, the procedure from which they can be obtained, 
and a bibliographical reference for each test. 


Table 15.1 Available tests 


Measure of Association Procedure Reference 
Contingency coefficients Crosstabs Liebetrau (1983) 
Goodman and Kruskal’s tau Crosstabs Bishop et al. (1975) 
Uncertainty coefficient Crosstabs IMSL (1994) 


Contingency Coefficients 


All of the measures of association in this family are functions of the Pearson chi-square 
statistic CH(x) , specified by Equation 10.3. They include the phi contingency coeffi- 
cient, the Pearson contingency coefficient, and Cramér’s V contingency coefficient. All 
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of these measures have an identical two-sided p value for testing the null hypothesis that 
there is no association, which is the same as the Pearson chi-square p value and which 
is based on the distribution of CH(y) . Exact Tests reports both the asymptotic and exact 
p values. 

The formulas for computing the three contingency coefficients are given below. The 
formula for each measure involves taking the square root of a function of CH(x) . The 
positive root is always selected. For a more detailed discussion of these measures of as- 
sociation, see Liebetrau (1983). 

The phi contingency coefficient is given by the formula 


> = rel) Equation 15.1 


N 


The minimum value assumed by © is 0, signifying no association. However, its upper 
bound is not fixed but depends on the dimensions of the contingency table. Therefore, 
it is not a very suitable measure for arbitrary r xc tables. For the special case of the 
2 x2 table, Gibbons (1985) shows that is identical to the absolute value of Kendall’s 
t, coefficient and is evaluated by the formula 


_ %11%22 — ¥12%1 
o a el es 
YI MyM {No 


Notice from Equation 15.2 that, for the 2x2 contingency table, > could be either 
positive or negative, which implies a positive or negative association in the 2 x 2 table. 
The Pearson contingency coefficient is given by the formula 


CGS aces Equation 15.3 
CH(x) +N 


This contingency coefficient assumes a minimum value of 0, signifying no association. 
It is bounded from above by 1, signifying perfect association. However, the maximum 
value attainable by CC is /(q-—1)/q , where g = min(7, c). Thus, the range of this 
contingency coefficient still depends on the dimensions of the r x c table. Cramér’s V 
coefficient ranges between 0 and 1, with 0 signifying no association and 1 signifying 
perfect association. It is given by 


V= OA). Equation 15.4 
N(q-1) 


Exact Tests reports the point estimate of the contingency coefficient. The formulas for 
these asymptotic standard errors are fairly complicated. These formulas are described in 
the algorithms manual available on the Manuals CD and also available by selecting 
Algorithms on the Help menu. 


Equation 15.2 
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These measures may be used to analyze an unordered contingency table given in Sie- 
gel and Castellan (1988). The data consist of a crosstabulation of three possible responses 
(completed, declined, no response) to a questionnaire concerning the financial account- 
ing standards used by six different organizations responsible for maintaining such stan- 
dards. These organizations are identified only by their initials (444A, AICPA, FAF, FASB, 
FET, and NAA). The crosstabulated data are shown in Figure 15.1. 


Figure 15.1 Crosstabulation of response to survey and finance organization 


Survey Disposition * Finance Organization Crosstabulation 


Count 
Finance Organization 
AAA AICPA FAF FASB FEl NAA 
Survey Disposition | Completed 8 8 3 11 17 
Declined 2 5 1 2 13 
eae 12 8 15 19 18 


First, these data are analyzed using only the first three columns of Figure 15.1. For this 
subset of the data, Figure 15.2 shows the results for the contingency coefficients. The 
exact two-sided p value for testing the null hypothesis that there is no association is also 
reported. Its value is 0.090, slightly lower than the asymptotic p value of 0.092. 


Figure 15.2 Phi and Cramér’s V for first three columns for survey and finance organization 


data 
Symmetric Measures 
Approx. Exact 
Value Sig. Significance 
Nominal by Nominal | Phi 359 .092 .090 
ia 254 092 090 


N of Valid Cases 
62 


The next analysis uses the full data set, which consists of all six columns of Figure 15.1. 
This data set is too large to compute the exact p value. However, a 99% confidence in- 
terval on the exact p value based on 10,000 Monte Carlo samples is easily obtained. The 
results are shown in Figure 15.3. 
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Figure 15.3 Monte Carlo results for phi and Cramér’s V 


Symmetric Measures 


Monte Carlo Significance 


99% Confidence 


Interval 
Approx. Lower Upper 
Value Sig. Sig. Bound Bound 
Nominal by Nominal | Phi AZo .000 .0000' .0000 -0005 
1 
on $ 511 000 0000 0000 0005 


N of Valid Cases 


1. Based on 10000 and seed 2000000 ... 


The p value for testing the null hypothesis that there is no association is at most 0.0005 
with 99% confidence, which implies that the row and column classifications are not 


independent. 


Proportional Reduction in Prediction Error 


In regression problems involving continuous data, the coefficient of determination (or R 
statistic) is often used to measure the proportion of the total variation attributable to the 
explanatory variable. It would be useful to provide an analog of this index for nominal cat- 
egorical data. Two measures of association are available for this purpose. One is Goodman 
and Kruskal’s tau, and the other is the uncertainty coefficient. Both measure the proportion 


of variation in the row variable that can be attributed to the column variable. 


Goodman and Kruskal’s Tau 


Goodman and Kruskal’s tau coefficient for measuring the proportion of the variation in 


the row variable attributable to the column variable is estimated by 


Cc 
eos = 
RIC 


N-N'x"_\m? 


7 


-lur Z —-lur 
= 17%; X= 1%jj—7N z= 1; 


Equation 15.5 
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This coefficient ranges between 0 and 1, with 0 implying no reduction in row variance 
when the column category is known, and | implying complete reduction in row variance 
when the column category is known. An asymptotic confidence interval for the Good- 
man and Kruskal’s tau can be obtained by computing the asymptotic standard error 
ASE1 and applying it to Equation 13.1. The exact two-sided p values for testing the null 
hypothesis that there is no association is obtained by substituting Tr)c(x) for M(x) in 
Equation 13.1. The corresponding asymptotic two-sided p value is obtained by using the 
fact that Tr\c(x) converges to a chi-square distribution with (r—1)(c—1) degrees of 
freedom. 


Uncertainty Coefficient 


The uncertainty coefficient is derived from the likelihood-ratio statistic and is an alter- 
native way to measure the proportion of the variation in the row variable attributable to 
the column variable. It is estimated by 


LD; _ Ls _ ,x,,log(m.n,/Nx,, 
Uric) = ~bereeeeee ae iD Equation 15.6 


x; - ym;log(m,/N) 


This uncertainty coefficient ranges between 0 and 1, with 0 implying no reduction in 
row variance when the column category is known, and | implying complete reduction 
in row variance when the column category is known. 

An asymptotic confidence interval for the uncertainty coefficient can be obtained by 
computing the asymptotic standard error ASE1 and applying it to Equation 13.1. The 
exact two-sided p values for testing the null hypothesis that there is no association is 
obtained by substituting U, Ric(*) for M(x) in Equation 13.1. The corresponding as- 
ymptotic two-sided p value is obtained by using the fact that Ug)-(x) converges to a 
chi-square distribution with (r—1)(c—1) degrees of freedom. 


Example: Party Preference Data 


The data set shown in Figure 15.4 illustrates the use of Goodman and Kruskal’s tau and 
the uncertainty coefficient. The data set compares party preference with preferred cold war 
ally in Great Britain. These data are taken from Bishop, Fienberg, and Holland (1975). 
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Figure 15.4 Crosstabulation of party preference with preferred cold war ally 


Count 
Preferred Cold War 
Ally 
U.S. U.S.S.R. 
Party Preference | Right 225 3 
Center 53 1 
Left 206 12 


First, Goodman and Kruskal’s tau is estimated, a confidence interval is obtained for it, 
and the null hypothesis that there is no association in the population is tested. The results 
are shown in Figure 15.5. 


Figure 15.5 Goodman and Kruskal’s tau for party preference and preferred cold war ally 
data 


Directional Measures 


Asymp. Approx. Exact 
Value | Std. Error! | — Sig. Significance 
Nominal Goodman | Party 4 
by Nominal | and Preference .010 .006 .008 .015 
Kruskal Dependent 
tau Preferred 
4 
Gold Wee 013 010 036 045 
Ally 
Neanandant 


Not assuming the null hypothesis 
4. Based on the chi-square approximation 


The observed value of Goodman and Kruskal’s tau with ally, 0.013, is rather small and 
leads to the conclusion that 1.3% of the variation in choice of preferred ally is explained 
by knowing a person’s party preference. The exact p value, 0.045, implies that the null 
hypothesis that there is no association can be rejected at the 5% level. In other words, 
the small amount of explained variation is real, not due to sampling error. 

Next, the uncertainty coefficient is estimated, a confidence interval is obtained for it, 
and the null hypothesis that there is no association in the population is tested. The results 
are shown in Figure 15.6. 
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Figure 15.6 Uncertainty coefficient for party preference and preferred cold war ally data 
Directional Measures 
Asymp. Approx. Exact 
Value | Std. Error! | Approx. T?] Sig. Significance 
Nominal by Nominal Uncertainty | Symmetric .012 .009 1.346 033 .034 
Coefficient Party : 
Preference .007 .005 1.346 .033 .034 
Dependent 
Preferred 
3 
xi wet 048 034 1.346 033 034 
Dependent 


Not assuming the null hypothesis 


3 Using the asymptotic standard error assuming the null hypothesis 


3- Likelihood ratio chi-square probability 


Once again, the observed value of the uncertainty coefficient with ally, 0.007, is ex- 
tremely small. However, the exact two-sided p value, 0.034, is statistically significant 
and indicates that the measure is indeed greater than 0. 


Kappa 


Measures of Agreement 


This chapter discusses kappa, a measure used to assess the level of agreement between 
two observers classifying a sample of objects on the same categorical scale. The joint 
ratings of the observers are displayed on a square r x r contingency table such as Table 
13.1. Kappa (see Agresti, 1990) can be obtained using the Crosstabs procedure. 


The kappa coefficient is defined on a square r x r contingency table. It is estimated by 


r r 
NX. _ ,x.,-X._ mon, 
Goes i=l‘ii i=l Equation 16.1 


2 
N°~%; = MN; 


Notice that the kappa statistic does not depend on the off-diagonal elements of the 
observed contingency table. Ifthe row classification is by one observer, and the column 
classification is by a second observer, this measure of agreement is determined entirely 
by the diagonal elements. 


Example: Student Teacher Ratings 


Consider the following data on student teachers who were rated by their supervisors, 
represented by variables super] and super2. The students were rated as authoritarian, 
democratic, or permissive. The full data set of 72 student teachers is available in Bish- 
op, Fienberg, and Holland (1975). In the following example, a subset of 10 students is 
considered. The crosstabulated data are shown in Figure 16.1. 
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Figure 16.1 Crosstabulation of student teachers rated by supervisors (partial data) 


Rating by Supervisor 1 * Rating by Supervisor 2 Crosstabulation 


Count 


Rating by 
Supervisor 
1 


Authoritarian 
Democratic 


Permissive 


Rating by Supervisor 2 


Authoritarian | Democratic 
3 

2 
2 


Permissive 


The results for the kappa statistic are shown in Figure 16.2. 


Figure 16.2 Kappa for student teacher ratings data 


Symmetric Measures 


N of Valid Cases 


Asymp. Approx. Exact 
Value Std. Error | Approx. T Sig. Significance 
Measure of Agreement Kappa 531 .237 2.348 .019 .048 


The value of kappa is estimated at K = 0.531. The positive sign on the kappa statistic 
implies that the agreement is positive. The exact two-sided p value of 0.048 is 
significant; thus, you can reject the null hypothesis that there is no agreement. Notice, 


however, that the asymptotic two-sided p value is not very accurate for this small data 


set. It is less than one half of the exact p value. 
The same analysis conducted with the full data set of 72 observations is tabulated in 


Figure 16.3. 


Figure 16.3 Crosstabulation of student teachers rated by supervisors (full data) 


Rating by Supervisor 1 * Rating by Supervisor 2 Crosstabulation 


Count 


Authoritarian 


Rating by Supervisor 2 


Democratic 


Permissive 


Rating by 
Supervisor 
1 


Authoritarian 


Democratic 


Permissive 


17 
5 
10 


4 
12 
3 


8 


13 
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For this larger data set, it is more efficient to perform the Monte Carlo inference rather 
than the exact inference. Figure 16.4 shows the results based on 10,000 Monte Carlo 
samples. 


Figure 16.4 Monte Carlo results for student teacher ratings data 


Symmetric Measures 


Monte Carlo Significance 
99% Confidence 


Interval 
Asymp. Approx. Lower Upper 
Value Std. Error | Approx. T Sig. Sig. Bound Bound 
Measure of Agreement Kappa 362 .091 4.329 .000 .0000' -0000 .0005 


N of Valid Cases 72 


1. Based on 10000 and seed 2000000 ... 


In the full data set, the kappa statistic has a smaller value, 0.362. However, due to the 
larger sample size this observed statistic is highly significant, with a two-sided p value 
guaranteed to be less than 0.0005 with 99% confidence. 


Syntax Reference 


CROSSTABS 


Exact Tests Syntax 


The /METHOD subcommand allows you to specify the method used to calculate significance 
levels. See the Syntax Reference Guide for a description of the full CROSSTABS syntax. 


METHOD Subcommand 


Displays additional results for each statistic requested. If no METHOD subcommand is spec- 
ified, the standard asymptotic results are displayed. If fractional weights have been speci- 
fied, results for all methods will be calculated on the weight rounded to the nearest integer. 


MC 


CIN(n) 


SAMPLES 


EXACT 


TIMER(n) 


Displays an unbiased point estimate and confidence interval based on the 
Monte Carlo sampling method, for all statistics. Asymptotic results are also 
displayed. When exact results can be calculated, they will be provided instead 
of the Monte Carlo results. See Appendix A for details of the situations under 
which exact results are provided instead of Monte Carlo results. Two optional 
keywords, CIN and SAMPLES, are provided if you choose /METHOD=MC. 


Controls the confidence level for the Monte Carlo estimate. CIN is available 
only when /METHOD=MC is specified. CIN has a default value of 99.0. You 
can specify a confidence interval between 0.01 and 99.9, inclusive. 


Specifies the number of tables sampled from the reference set when calcu- 
lating the Monte Carlo estimate of the exact p value. Larger sample sizes 
lead to narrower confidence limits, but also take longer to calculate. You can 
specify any integer between 1 and 1,000,000,000 as the sample size. SAM- 
PLES has a default value of 10,000. 


Computes the exact significance level for all statistics, in addition to the as- 
ymptotic results. If both the EXACT and MC keywords are specified, only ex- 
act results are provided. Calculating the exact p value can be memory-inten- 
sive. If you have specified /METHOD=EXACT and find that you have insuffi- 
cient memory to calculate results, you should first close any other applications 
that are currently running in order to make more memory available. You can 
also enlarge the size of your swap file (see your Windows manual for more 
information). If you still cannot obtain exact results, specify /METHOD=MC to 
obtain the Monte Carlo estimate of the exact p value. An optional TIMER key- 
word is available if you choose /METHOD=EXACT. 


Specifies the maximum number of minutes allowed to run the exact analy- 
sis for each statistic. If the time limit is reached, the test is terminated, no 
exact results are provided, and the application begins to calculate the next 
test in the analysis. TIMER is available only when /METHOD=EXACT is 
specified. You can specify any integer value for TIMER. Specifying a value 
of 0 for TIMER turns the timer off completely. TIMER has a default value of 
5 minutes. If a test exceeds a time limit of 30 minutes, it is recommended 
that you use the Mongg Carlo, rather than the exact, method. 
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NPAR TESTS 


Exact Tests Syntax 


The METHOD subcommand allows you to specify the method used to calculate significance 
levels. The MH subcommand performs the marginal homogeneity test. The J-T subcommand 
performs the Jonckheere-Terpstra test. See the Syntax Reference Guide for a complete de- 
scription of the full NPAR TESTS syntax. 


METHOD Subcommand 


Displays additional results for each statistic requested. Ifno METHOD subcommand is spec- 
ified, the standard asymptotic results are displayed. 


MC 


CIN(n) 


SAMPLES 


EXACT 


TIMER(n) 


Displays an unbiased point estimate and confidence interval based on the 
Monte Carlo sampling method, for all statistics. Asymptotic results are also 
displayed. When exact results can be calculated, they will be provided instead 
of the Monte Carlo results. See Appendix A for details of the situations under 
which exact results are provided instead of Monte Carlo results. Two optional 
keywords, CIN and SAMPLES, are provided if you choose /METHOD=MC. 


Controls the confidence level for the Monte Carlo estimate. CIN is available 
only when /METHOD=MC is specified. You can specify a confidence inter- 
val between 0.01 and 99.9, inclusive. 


Specifies the number of tables sampled from the reference set when calcu- 
lating the Monte Carlo estimate of the exact p value. Larger sample sizes 
lead to narrower confidence limits, but also take longer to calculate. You can 
specify any integer between 1 and 1,000,000,000 as the sample size. SAM- 
PLES has a default value of 10,000. 


Computes the exact significance level for all statistics, in addition to the as- 
ymptotic results. If both the EXACT and MC keywords are specified, only ex- 
act results are provided. Calculating the exact p value can be memory-inten- 
sive. If you have specified /METHOD=EXACT and find that you have insuffi- 
cient memory to calculate results, you should first close any other applications 
that are currently running in order to make more memory available. You can 
also enlarge the size of your swap file (see your Windows manual for more 
information). If you still cannot obtain exact results, specify /METHOD=MC to 
obtain the Monte Carlo estimate of the exact p value. An optional TIMER key- 
word is available if you choose /METHOD=EXACT. 


Specifies the maximum number of minutes allowed to run the exact analy- 
sis for each statistic. If the time limit is reached, the test is terminated, no 
exact results are provided, and the application begins to calculate the next 
test in the analysis. TIMER is available only when /METHOD=EXACT is 
specified. You can specify any integer value for TIMER. Specifying a value 
of 0 for TIMER turns the timer off completely. TIMER has a default value of 
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5 minutes. Ifa test exceeds a time limit of 30 minutes, it is recommended that 
you use the Monte Carlo, rather than the exact, method. 


MH Subcommand 


Syntax 


Operations 


Example 


NPAR TESTS /MH=varlist [WITH varlist [(PAIRED) ]] 


MH performs the marginal homogeneity test, which tests whether combinations of values be- 
tween two paired ordinal variables are equally likely. The marginal homogeneity test is typ- 
ically used in repeated measures situations. This test is an extension of the McNemar test 
from binary response to multinomial response. The output shows the number of distinct val- 
ues for all test variables, the number of valid off-diagonal cell counts, mean, standard devi- 
ation, observed and standardized values of the test statistics, the asymptotic two-tailed 
probability for each pair of variables, and, if a /METHOD subcommand is specified, one-tailed 
and two-tailed exact or Monte Carlo probabilities. 


e The minimum specification is a list of two variables. Variables must be polychotomous 
and must have more than two values. If the variables contain more than two values, the 
McNemar test is performed. 


e Ifkeyword WITH is not specified, each variable is paired with every other variable in the 
list. 


e If WITH is specified, each variable before WITH is paired with each variable after WITH. 
If PAIRED is also specified, the first variable before WITH is paired with the first variable 
after WITH, the second variable before WITH with the second variable after WITH, and so 
on. PAIRED cannot be specified without WITH. 


¢ With PAIRED, the number of variables specified before and after WITH must be the same. 
PAIRED must be specified in parentheses after the second variable list. 


e The data consist of paired, dependent responses from two populations. The marginal 
homogeneity test tests the equality of two multinomial c x 1 tables, and the data can be 
arranged in the form of a square c x c contingency table. A 2 xc table is constructed 
for each off-diagonal cell count. The marginal homogeneity test statistic is computed 
for cases with different values for the two variables. Only combinations for which the 
values for the two variables are different are considered. The first row of each 2 xc 
table specifies the category chosen by population 1, and the second row specifies the 
category chosen by population 2. The test statistic is calculated by summing the first 
row scores across all 2 x c tables. 


NPAR TESTS /MH=V1 V2 V3 
/METHOD=MC. 
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e This example performs the marginal homogeneity test on variable pairs V7 and V2, Vi and 
v3, and V2 and V3. The exact p values are estimated using the Monte Carlo sampling method. 


J-T Subcommand 


Syntax 


Operations 


Example 


NPAR TESTS /J-T=varlist BY variable(valuel,value2) 


J-T (alias JONCKHEERE-TERPSTRA) performs the Jonckheere-Terpstra test, which tests 
whether k independent samples defined by a grouping variable are from the same population. 
This test is particularly powerful when the & populations have a natural ordering. The output 
shows the number of levels in the grouping variable, the total number of cases, observed, 
standardized, mean and standard deviation of the test statistic, the two-tailed asymptotic sig- 
nificance, and, if a /METHOD subcommand is specified, one-tailed and two-tailed exact or 
Monte Carlo probabilities. 


e The minimum specification is a test variable, the keyword BY, a grouping variable, and a 
pair of values in parentheses. 

e Every value in the range defined by the pair of values for the grouping variable forms a 
group. 

e Ifthe (METHOD subcommand is specified, and the number of populations, k, is greater 
than 5, the p value is estimated using the Monte Carlo sampling method. The exact p value 
is not available when k exceeds 5. 


e Cases from the k groups are ranked in a single series, and the rank sum for each group is 
computed. A test statistic is calculated for each variable specified before BY. 


e The Jonckheere-Terpstra statistic has approximately a normal distribution. 


e Cases with values other than those in the range specified for the grouping variable are 
excluded. 


e The direction of a one-tailed inference is indicated by the sign of the standardized test 
statistic. 


NPAR TESTS /J-T=V1 BY V2 (0,4) 
/METHOD=EXACT. 


e This example performs the Jonckheere-Terpstra test for groups defined by values 0 
through 4 of V2. The exact p values are calculated. 


Conditions for Exact Tests 


There are certain conditions under which exact results are always provided, even when 
you have specified the Monte Carlo method either through the dialog box or through 
syntax. Table A.1 displays the conditions for the relevant tests under which exact 
results are always provided and a request for the Monte Carlo method is ignored. 


Table A.1 Conditions under which exact tests are always provided 


Test 


Binomial test 


Fisher’s exact test 
Likelihood-ratio test 
Linear-by-linear association 


test 
McNemar test 


Median test 


Pearson chi-square test 
Sign test 


Wald-Wolfowitz runs test 


Procedure 

Nonparametric tests: Binomial 
Tests 

Crosstabs 

Crosstabs 

Crosstabs 


Nonparametric tests: Tests for 
two related samples 


Nonparametric tests: Tests for 
several related samples 


Crosstabs 


Nonparametric tests: Tests for 
two related samples 


Nonparametric tests: Tests for 
two independent samples 
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Condition 


Exact results are always 
provided 

2x2 table 

2x2 table 

2x2 table 


Exact results are always 
provided 


k = 2 andn<30 


2x2 table 
ns25 


Algorithms in Exact Tests 


Exact Algorithms 


An exact p value is computed by enumerating every single outcome in some suitably 
defined reference set, identifying all outcomes that are more extreme than the observed 
one, and summing their probabilities under the null hypothesis. Although this might 
appear to be a formidable computing problem by the time the size of the reference set 
exceeds, say, a few million, it is still feasible. Many researchers have worked on this 
problem and have developed fast numerical algorithms that enumerate all of the 
possible outcomes implicitly rather than explicitly. That is, these algorithms don’t 
examine each individual outcome separately. There are ways to identify large numbers 
of outcomes at one time and classify them as either more or less extreme than the 
observed outcome. A complete collection of reference files for all of these algorithms 
is available in the Exact-Stats Mailbase on the Internet. These references can be 
accessed through FTP, Gopher, or World Wide Web at the following addresses: 


ftp://mailbase.ac.uk/pub/lists/exact-stats/files 


gopher: //mailbase.ac.uk/Mailbase Lists - A-E/exact-stats/Other 
Files 


http://www.mailbase.ac.uk/Mailbase Lists - A-E/exact- 
stats/Other Files 

One class of algorithms, called network algorithms, was developed by Mehta, Patel, and 
their colleagues at the Harvard School of Public Health. These algorithms are referenced 
below in chronological order. Many of them have already been incorporated into Exact 
Tests, and others will be incorporated into future releases of the software. 


Mehta, C. R., and N. R. Patel. 1980. A network algorithm for the exact treatment of the 
2 x k contingency table. Communications in Statistics, 9:6, 649-664. 

Mehta, C. R., and N. R. Patel. 1983. A network algorithm for performing Fisher’s exact test 
in rXc contingency tables. Journal of the American Statistical Association, 78:382, 
427-434. 

Mehta, C.R., N. R. Patel, and A. Tsiatis. 1984. Exact significance testing to establish treat- 
ment equivalence ordered categorical data. Biometrics, 40: 819-825. 
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Mehta, C. R., N. R. Patel, and R. Gray. 1985. On computing an exact confidence interval for 
the common odds ratio in several 2 x 2 contingency tables. Journal of the American Sta- 
tistical Association, 80:392, 969-973. 

Mehta, C. R., and N. R. Patel. 1986. A hybrid algorithm for Fisher’s exact test in unordered 
rXc contingency tables. Communications in Statistics, 15:2, 387-403. 

Mehta, C. R., and N. R. Patel. 1986. FEXACT: A FORTRAN subroutine for Fisher’s exact 
test on unordered r X c contingency tables. ACM Transactions on Mathematical Soft- 
ware, 12:2, 154-161. 

Hirji, K., C. R. Mehta, and N. R. Patel. 1987. Computing distributions for exact logistic 
regression. Journal of the American Statistical Association, 82:400, 1110-1117. 

Mehta, C. R., N. R. Patel, and L. J. Wei. 1988. Constructing exact significance tests with 
restricted randomization rules. Biometrika, 75:2, 295-302. 

Hirji, K., C. R. Mehta, and N. R. Patel. 1988. Exact inference for matched case control 
studies. Biometrics, 44:3, 803-814. 

Agresti, A., C. R. Mehta, and N. R. Patel. 1990. Exact inference for contingency tables with 
ordered categories. Journal of the American Statistical Association, 85:410, 453-458. 
Mehta, C. R., N. R. Patel, and P. Senchaudhuri. 1992. Exact stratified linear rank tests for 
ordered categorical and binary data. Journal of Computational and Graphical Statistics, 

1: 21-40. 

Mehta, C. R. 1992. An interdisciplinary approach to exact inference for contingency tables. 
Statistical Science, 7: 167-170. 

Hilton, J., and C. R. Mehta. 1993. Power and sample size calculations for exact conditional 
tests with ordered categorical data. Biometrics, 49: 609-616. 

Hilton, J., C. R. Mehta, and N. R. Patel. 1994. Exact Smirnov p values using a network 
algorithm. Computational Statistics and Data Analysis, 17:4, 351-361. 

Mehta, C.R., N. R. Patel, P. Senchaudhuri, and A. A. Tsiatis. 1994. Exact permutational tests 
for group sequential clinical trials. Biometrics, 50:4, 1042-1053. 


Monte Carlo Algorithms 


Monte Carlo algorithms solve a slightly easier computational problem. They do not 
attempt to enumerate all of the members of the reference set. Instead, they estimate the 
p value by taking a random sample from the reference set. The Monte Carlo algorithms 
in Exact Tests make use of ideas in the following papers (in chronological order): 


Agresti, A., D. Wackerly, and J. M. Boyett. 1979. Exact conditional tests for cross-classifi- 
cations: Approximations of attained significance levels. Psychometrika, 44: 75-83. 

Patefield, W. M. 1981. An efficient method of generating r x c tables with given row and 
column totals. (Algorithm AS 159.) Applied Statistics, 30: 91-97. 

Mehta, C. R., N. R. Patel, and P. Senchaudhuri. 1988. Importance sampling for estimating 
exact probabilities in permutational inference. Journal of the American Statistical Asso- 
ciation, 83:404, 999-1005. 

Senchaudhuri, P., C. R. Mehta, and N. R. Patel. 1995. Estimating exact p values by the method 
of control variates, or Monte Carlo rescue. Journal of American Statistical Association. 


Algorithms in Exact Tests 211 


212 Appendix B 


Notices 


This information was developed for products and services offered in the U.S.A. 


IBM may not offer the products, services, or features discussed in this document in other 
countries. Consult your local IBM representative for information on the products and 
services currently available in your area. Any reference to an IBM product, program, or 
service is not intended to state or imply that only that IBM product, program, or service may 
be used. Any functionally equivalent product, program, or service that does not infringe any 
IBM intellectual property right may be used instead. However, it is the user's responsibility 
to evaluate and verify the operation of any non-IBM product, program, or service. 


IBM may have patents or pending patent applications covering subject matter described in 
this document. The furnishing of this document does not grant you any license to these 
patents. You can send license inquiries, in writing, to: 


IBM Director of Licensing 
IBM Corporation 

North Castle Drive 
Armonk, NY 10504-1785 
U.S.A. 


For license inquiries regarding double-byte character set (DBCS) information, contact the 
IBM Intellectual Property Department in your country or send inquiries, in writing, to: 


Intellectual Property Licensing 
Legal and Intellectual Property Law 
IBM Japan Ltd. 

1623-14, Shimotsuruma, Yamato-shi 
Kanagawa 242-8502 Japan 
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The following paragraph does not apply to the United Kingdom or any other country 
where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS 
MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT 
WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT 
NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, 
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do 
not allow disclaimer of express or implied warranties in certain transactions, therefore, this 
statement may not apply to you. 


This information could include technical inaccuracies or typographical errors. Changes are 
periodically made to the information herein; these changes will be incorporated in new 
editions of the publication. IBM may make improvements and/or changes in the product(s) 
and/or the program(s) described in this publication at any time without notice. 


Any references in this information to non-IBM Web sites are provided for convenience only 
and do not in any manner serve as an endorsement of those Web sites. The materials at those 
Web sites are not part of the materials for this IBM product and use of those Web sites is at 
your own risk. 


IBM may use or distribute any of the information you supply in any way it believes 
appropriate without incurring any obligation to you. 


Licensees of this program who wish to have information about it for the purpose of enabling: 
(i) the exchange of information between independently created programs and other programs 
(including this one) and (ii) the mutual use of the information which has been exchanged, 
should contact: 


IBM Software Group 
Attention: Licensing 
233 S. Wacker Drive 
Chicago, IL 60606 
U.S.A. 


Such information may be available, subject to appropriate terms and conditions, including in 
some cases, payment of a fee. 


The licensed program described in this document and all licensed material available for it are 
provided by IBM under terms of the IBM Customer Agreement, IBM International Program 
License Agreement or any equivalent agreement between us. 
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Information concerning non-IBM products was obtained from the suppliers of those 
products, their published announcements or other publicly available sources. IBM has not 
tested those products and cannot confirm the accuracy of performance, compatibility or any 
other claims related to non-IBM products. Questions on the capabilities of non-IBM 
products should be addressed to the suppliers of those products. 


All statements regarding IBM's future direction or intent are subject to change or withdrawal 
without notice, and represent goals and objectives only. 


This information contains examples of data and reports used in daily business operations. To 
illustrate them as completely as possible, the examples include the names of individuals, 
companies, brands, and products. All of these names are fictitious and any similarity to the 
names and addresses used by an actual business enterprise is entirely coincidental. 


If you are viewing this information softcopy, the photographs and color illustrations may not 
appear. 


Trademarks 


IBM, the IBM logo, and ibm.com, and SPSS are trademarks or registered trademarks of 
International Business Machines Corp., registered in many jurisdictions worldwide. Other 
product and service names might be trademarks of IBM or other companies. A current list of 
IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade. shtml. 


Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft 
Corporation in the United States, other countries, or both. 


Bibliography 


Agresti, A. 1990. Categorical data analysis. New York: John Wiley and Sons. 

. 1992. A survey of exact inference for contingency tables. Statistical Science, 7:1, 
131-177. 

Agresti, A., and M. C. Yang. 1987. An empirical investigation of some effects of sparseness 
in contingency tables. Computational Statistics and Data Analysis, 5: 9-21. 

Bishop, Y. M. M., S. E. Fienberg, and P. W. Holland. 1975. Discrete multivariate analysis: 
Theory and practice. Cambridge, Mass.: MIT Press. 

Breslow, N. E., and N. E. Day. 1980. The analysis of case-control studies. ARC Scientific 
Publications, No. 32. Lyon, France. 

Chapman, J. W. 1976. A comparison of the chi-square, —2 log R, and multinomial proba- 
bility criteria for significance tests when expected frequencies are small. Journal of the 
American Statistical Association, 71: 854-863. 

Chernoff, H., and I. R. Savage. 1958. Asymptotic normality and efficiency of certain non- 
parametric test statistics. Annals of Mathematical Statistics, 29: 972-994. 

Cochran, W. G. 1936. The chi-square distribution for the binomial and Poisson series, with 
small expectations. Annals of Eugenics, London, 7: 207-217. 

. 1954. Some methods for strengthening the common chi-square tests. Biometrics, 10: 
417-454. 

Conover, W. J. 1980. Practical nonparametric statistics. 2nd ed. New York: John Wiley 
and Sons. 

Edgington, E. 8. 1987. Randomization tests. 2nd ed. New York: Marcel Dekker. 

Feynman, R. 1988. What Do You Care What Other People Think? New York: W. W. 
Norton and Co. 

Fisher, R. A. 1924. The condition under which chi-square measures the discrepancy be- 
tween observation and hypothesis. Journal of the Royal Statistical Society, 87: 442-450. 

. 1925. Statistical methods for research workers. Edinburgh: Oliver and Boyd. 

____. 1935a. The logic of inductive inference. Journal of the Royal Statistical Society, 

98: 39-54. 

. 1935b. The design of experiments. Edinburgh: Oliver and Boyd. 

. 1973. Statistical methods and scientific inference. 3rd ed. London: Collier Macmillan 
Publishers. 

Freeman, G. H, and J. H. Halton. 1951. Note on an exact treatment of contingency, good- 
ness of fit and other problems of significance. Biometrika, 38: 141-149. 

Friedman, M. 1937. The use of ranks to avoid the assumption of normality implicit in the anal- 
ysis of variance. Journal of the American Statistical Association, 32: 675-701. 

Gastwirth, J. L. 1991. Statistical reasoning in a legal setting. American Statistician, February. 

Gibbons, J. D. 1985. Nonparametric statistical inference. 2nd ed. New York: Marcel Dekker. 

Good, P. 1993. Permutation tests. New York: Springer-Verlag. 


217 


© Copyright IBM Corporation. 1989, 2013 


218 


Bibliography 


Goodman, L. A. 1954. Kolmogorov-Smirnov tests for psychological research. Psychological 
Bulletin, 51: 160-168. 

. 1968. The analysis of cross-classified data: Independence, quasi-independence, and 
interactions in contingency tables with or without missing entries. Journal of the Ameri- 
can Statistical Association, 63: 1091-1113. 

Goodman, L. A., and W. H. Kruskal. 1979. Measures of association for cross-classifications. 
New York: Springer-Verlag. 

Graubard, B. I., and E. L. Korn. 1987. Choice of column scores for testing independence in 
ordered 2 x K contingency tables. Biometrics, 43: 471-476. 

Hajek, J. 1969. Nonparametric statistics. San Francisco: Holden-Day. 

Hajek, J., and Z. Sidak. 1967. Theory of rank tests. New York: Academic Press, Inc. 

Hollander, M., and D. A. Wolfe. 1973. Nonparametric statistical methods. New York: John 
Wiley and Sons. 

Kendall, M. G. 1938. A new measure of rank correlation. Biometrika, 30: 81-93. 

Kendall, M. G., and B. Babington-Smith. 1939. The problem of m rankings. Annals of Math- 
ematical Statistics, 10: 275-287. 

Kendall, M. G., and A. Stuart. 1979. The advanced theory of statistics. 4th ed. New York: 
Macmillan Publishing Co. Inc. 

Kruskal, W. H., and W. A. Wallis. 1952. Use of ranks in one-criterion variance analysis. 
Journal of the American Statistical Association, 47: 583-621. 

Kuritz, S. J., J. R. Landis, and G. G. Koch. 1988. A general overview of Mantel-Haenszel 
methods: Applications and recent developments. Annual Review of Public Health, 9: 
123-60. 

Lancaster, H. O. 1961. Significance tests in discrete distributions. Journal of the American 
Statistical Association, 56: 223-234. 

Lehmann, E. L. 1975. Nonparametrics: Statistical methods based on ranks. San Francisco: 
Holden-Day. 

Liebetrau, A. M. 1983. Measures of association. Beverly Hills, Calif.: Sage Publications. 

Little, R. J. A. 1989. Testing the equality of two independent binomial proportions. The 
American Statistician, 43: 283-288. 

Makuch, R. W., and W. P. Parks. 1988. Response of serum antigen level to AZT for the treat- 
ment of AIDS. AIDS Research and Human Retroviruses, 4: 305-316. 

Manley, B. F. J. 1991. Randomization and Monte Carlo methods in biology. London: Chap- 
man and Hall. 

Mehta, C. R., and N. R. Patel. 1983. A network algorithm for performing Fisher’s exact test 
in rX ccontingency tables. Journal of the American Statistical Association, 78:382, 
427-434. 

. 1986a. A hybrid algorithm for Fisher’s exact test on unordered r X c contingency ta- 
bles. Communications in Statistics, 15:2, 387-403. 

. 1986b. FEXACT: A FORTRAN subroutine for Fisher’s exact test on unordered 
r X c contingency tables. ACM Transactions on Mathematical Software, 12:2, 154-161. 

Mehta, C. R., N. R. Patel, and P. Senchaudhuri. 1988. Importance sampling for estimating 
exact probabilities in permutational inference. Journal of the American Statistical Asso- 
ciation, 83:404, 999-1005. 

Miettinen, O. S. 1985. Theoretical epidemiology: Principles of occurrence research in med- 
icine. John Wiley and Sons, New York. 


Bibliography 219 


Pearson, K. 1900. On the criterion that a given system of deviations from the probable in the 
case of a correlated system of variables is such that it can be reasonably supposed to have 
arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Mag- 
azine and Journal of Science, Series 5, 50: 157-175. 

Pitman, E. J. G. 1948. Notes on non-parametric statistical inference. Columbia University 
(duplicated). 

Pratt, J. W., and J. D. Gibbons. 1981. Concepts of nonparametric theory. New York: 
Springer-Verlag. 

Radlow, R., and E. F. Alf. 1975. An alternate multinomial assessment of the accuracy of the 
chi-square test of goodness of fit. Journal of the American Statistical Association, 70: 
811-813. 

Read, T. R., and N. A. Cressie. 1988. Goodness-of-fit statistics for discrete multivariate data. 
New York: Springer-Verlag. 

Roscoe, J. T., and J. A. Byars. 1971. An investigation of the restraints with respect to sample 
size commonly imposed on the use of the chi-square statistic. Journal of the American 
Statistical Association, 66:336, 755—759. 

Senchaudhuri, P., C. R. Mehta, and N. R. Patel. 1995. Estimating exact p values by the meth- 
od of control variates, or Monte Carlo rescue. Journal of the American Statistical Associ- 
ation (forthcoming). 

Siegel, S. 1956. Nonparametric statistics for the behavioral sciences. New Y ork: McGraw- 
Hill. 

Siegel, S., and N. J. Castellan. 1988. Nonparametric statistics for the behavioral sciences. 
2nd ed. New York: McGraw-Hill. 

Smimov, N. V. 1939. Estimate of deviation between empirical distribution functions in two 
independent samples. Bulletin Moscow University, 2:2, 3-16. 

Snapinn, S. M., and R. D. Small. 1986. Tests of significance using regression models for or- 
dered categorical data. Biometrics, 42: 583-592. 

Sprent, P. 1993. Applied nonparametric statistical methods. 2nd ed. London: Chapman and 
Hall. 

Wald, A., and J. Wolfowitz. 1940. On a test whether two samples are from the same popula- 
tion. Annals of Mathematical Statistics, 11: 147-162. 

Westfall, P. H., and S. S. Young. 1993. Resampling-based multiple testing: Examples and 
methods for p value adjustment. New York: John Wiley and Sons. 

White, A. A., R. J. Landis, and M. M. Cooper. 1982. A note on the equivalence of several 
marginal homogeneity test criteria for categorical data. International Statistical Review, 
50: 27-34. 

Yates, F. 1984. Test of significance for 2 x 2 contingency tables. Journal of the Royal Sta- 
tistical Society, Series A, 147: 426-463. 

Yule, G. U. 1912. On the methods of measuring association between two attributes. Journal 
of the Royal Statistical Society, Series A, 75: 579. 


Index 


asymptotic method, 1 
asymptotic one-sided p value 


K independent samples, 122, 129, 131 


asymptotic one-sided p value 
Jonckheere-Terpstra test, 159 
Mann-Whitney test, 84 
asymptotic p value, 12 
assumptions, 12 
defined, 16 
measures of association, 169 
obtaining, 8 
Pearson’s chi-square, 16 
when to use, 16, 29-37 
asymptotic two-sided p value 
K independent samples, 122 
asymptotic two-sided p value 
Jonckheere-Terpstra test, 159 
K related samples, 101 
Mann-Whitney test, 84 
McNemar test, 69 
rx tables, 140 
sign test, 62 
Wilcoxon signed-ranks, 62 


binary data 
one-sample test, 49-55 


binomial test, 49-50 


example: pilot study for new drug, 50 


bivariate data 


measures of association, 166—167 


blocked comparisons, 95 


BY (keyword) 
NPAR TESTS command, 202 


categorical data 
assumptions, 12 

categorical variables, 135 

CIN (keyword) 
CROSSTABS command, 199 
NPAR TESTS command, 200 


class variables, 135 
Cochran’s Q test, 108-111 
example:cross-over clinical trial, 109-111 
when to use, 96 
Cohen’s kappa. See Kappa 
confidence levels 
specifying, 8 
contingency coefficients 
measures of association, 185, 185-188 
contingency tables. See r x c contingency tables 


continuous data 
assumptions, 12 


continuous variables, 135 


correlations 
Pearson’s product-moment correlation coefficient, 
172-174 
Spearman’s rank-order correlation coefficient, 
174-176 
Cramer’s V 
example, 187—188 
measures of association, 185-188 


CROSSTABS (command), 199—?? 
new syntax, 199 


Crosstabs procedure, 199 
asymptotic p value, 8 
confidence levels, 8 
contingency coefficients, 185 
exact pvalue, 9 
exact statistics, 7-9 
Fisher’s exact test, 141 
gamma, 171 
Goodman and Kruskal’s tau, 185 
Kendall’s tau-, 171 
Kendall’s tau-c, 171 
likelihood-ratio test, 141 
linear-by-linear association test, 155 
Monte Carlo p value, 8 
Pearson chi-square test, 141 
Pearson’s product moment correlation coefficient, 
171 
samples, 8 
Somers’ d, 171 
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Spearman’s rank-order correlation coefficient, 171 
time limit, 9 
uncertainty coefficient, 185 
crosstabulated data 
measures of association, 165—167 
crosstabulation, 199 
See also Crosstabs procedure 


data sets 
small, 30 
sparse, 36-37 
tied, 31-34 
unbalanced, 35 
doubly ordered contingency tables, 135 


doubly ordered contingency tables. See alsor x c 
contingency tables 


EXACT (keyword) 
CROSSTABS command, 199 
NPAR TESTS command, 200 


exact method, 1-3 
exact one-sided p value 
K independent samples, 134 
exact one-sided p value 
Jonckheere-Terpstra test, 159 
linear-by-linear association test, 162 
Mann-Whitney test, 82 
McNemar test, 69 
runs test, 92 
exact p value, 12, 16 
defined, 1 
example: fire figher data, 1-3 
obtaining, 9 
rx c tables, 136 
when to use, 24 
exact statistics 
obtaining, 7-9 
exact tests 
memory limits, 9 
setting time limit, 9 
when to use, 5 
exact two-sided p value 
K independent samples, 134 
median test, 124 
exact two-sided p value 
Jonckheere-Terpstra test, 160 


K related samples, 99 
Kolmogorov-Smirnov, 88 
linear-by-linear assocation test, 162 
Mann-Whitney test, 82 

McNemar test, 69 

measures of agreement, 168 
nominal data, 168 

ordinal data, 168 

rx c tables, 138 

runs test, 52 


Fisher’s exact test, 147-148 
example: 2 x 2 table, 18-24 
example: tea-tasting experiment, 18-24 
when to use, 141 

Friedman’s test, 101-104 
example: effect of hypnosis, 102-104 
when to use, 96 


full multinomial sampling, 137 


gamma, 171 
example: smoking habit data, 183-184 
measures of association, 183-184 
Goodman and Kruskal’s tau 
example: party preference data, 189-191 
measures of association, 185, 188-191 


independent samples, 75—94 
Jonckheere-Terpstra test, 114, 131-134 
when to use each test, 76 


Jonchkeere-Terpstra test 
example: space shuttle O-ring incidents, 132-134 


Jonckheere-Terpstra test 
asymptotic one-sided p value, 159 
asymptotic two-sided p value, 159 
exact one-sided p value, 159 
exact two-sided p value, 160 
example: dose-response data, 157-160 
in Tests for Several Independent Samples 
procedure, 202 
rx c contingency tables, 156-160 
when to use, 115, 156 


J-T (subcommand) 
NPAR TESTS command, 202 


K independent samples tests, 113-134 
Jonckheere-Terpstra test, 131-134 
Kruskal-Wallis test, 127-130 
median test, 122—127 
when to use, 114-115 


K related samples tests, 95-111 
Cochran’s Q, 108-111 
Friedman’s, 101-104 
Kendall’s W, 104-107 
when to use, 96 
kappa 
example:student teacher ratings, 193-195 
measures of agreement, 193-195 


Kendall’s coefficient of concordance. See Kendall’s 


Kendall’s tau 
example: smoking habit data, 180-182 
measures of association, 177—182 

Kendall’s tau-b, 171 

Kendall’s tau-c, 171 

Kendall’s W test, 104-107 
example: attendance at annual meeting, 105-107 
example: relationship to Spearman’s R, 107 
when to use, 96 

Kolmogorov-Smimov test, 87-91 
example: effectiveness of vitamin C, 90-91 
example:diastolic blood pressure data, 31-34 
when to use, 76 

Kruskal-Wallis test, 149-153 
example: hematologic toxicity data, 129-130 
example: tumor regression rates, 150-153 
when to use, 115, 143, 149 


likelihood ratio test 
example:sports activity data, 25-27 
likelihood-ratio test, 145-147 
when to use, 141 
linear-by-linear association test 
exact one-sided p value, 162 
exact two-sided p value, 162 
example: dose-response data, 161 
example:alcohol and birth defect data, 35 
rx c contingency tables, 161-164 
when to use, 156 


location-shift alternatives, 115 
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Mann-Whitney test, 80-86 
example: blood pressure data, 84-86 
when to use, 76 
Mantel-Haenszel test. See linear-by-linear association 
test 
marginal homogeneity test, 71-73 
example: matched-case control study, 71-72 
example: Pap-smear classification, 72-73 
in Two-Related-Samples Tests procedure, 
201-202 
when to use, 58 
MC (keyword) 
CROSSTABS command, 199 
NPAR TESTS command, 200 


McNemar test, 68—70 
exact one-sided p value, 69 
exact two-sided p value, 69 
example: voters’ preference, 70 
when to use, 58 


measures of agreement 
exact two-sided p value, 168 
kappa, 193-195 
measures of association 

asymptotic p values, 169 

bivariate data, 166-167 

contingency coefficients, 185, 185-188 

Cramer’s V, 185-188 

crosstabulated data, 165-167 

exact p values, 168-169 

gamma, 183-184 

Goodman and Kruskal’s tau, 188-191 

introduction, 165—170 

Kendall’s tau, 177-182 

Kendall’s W, 171 

Monte Carlo p values, 169 

nominal data, 185-191 

ordinal data, 171-184 

p values, 168-170 

Pearson’s product-moment correlation coefficient, 
171, 172-174 

phi, 185-188 

point estimates, 168 

proportional reduction in prediction error, 188-191 

proportional reduction in predictive error, 185 

Somers’ d, 177-182 

Spearman’s rank-order correlation coefficient, 
171, 174-176 

uncertainty coefficient, 189-191 
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median test, 122—127 
example: hematologic toxicity data, 125-127 
when to use, 115 
memory limits 
exact tests, 9 
METHOD (subcommand) 
CROSSTABS command, 199 
NPAR TESTS command, 200-201, 202 
MH (subcommand) 
NPAR TESTS command, 201—202 
Monte Carlo method, 3-4 
defined, 3 
example:fire figher data, 4 
random number seed, 9-10 
Monte Carlo one-sided p value 
sign test, 63 
Wilcoxon signed-ranks test, 63 
Monte Carlo p value 
obtaining, 8 
when to use, 24-29 
Monte Carlo p values 
measures of association, 169 
Monte Carlo two-sided p value 
K independent samples, 120 
median test, 124 
Monte Carlo two-sided p value 
K related samples, 100 
Kolmogorov-Smirnov, 88 
Mann-Whitney test, 83 
rxc tables, 139 
sign test, 64 
Wilcoxon signed-ranks test, 64 


nominal data 
contingency coefficients, 185-188 
Cramer’s V, 185-188 
exact two-sided p values, 168 
Goodman and Kruskal’s tau, 188-191 
phi, 185-188 
proportional reduction in prediction error, 188-191 
uncertainty coefficient, 189-191 


nominal variables, 135 


nonparametric tests 
assumptions, 12 
asymptotic p value, 8 
binomial, 49 
Cochran’s Q, 95 


confidence levels, 8 

exact pvalue, 9 

exact statistics, 7-9 

Friedman’s, 95 

Jonckheere-Terpstra test, 114, 155 

Kendall’s W, 95 

Kolmogorov-Smirnov, 75 

Kruskal-Wallis, 114, 149 

Mann-Whitney test, 75 

marginal homogeneity, 57 

McNemar, 57 

median test, 114 

Monte Carlo p value, 8 

new syntax, 200 

new tests, 9 

runs, 49, 75 

samples, 8 

sign, 57 

time limit, 9 

two-related samples, 57 

Wald-Wolfowitz runs test, 75 

Wilcoxon signed-ranks, 57 
NPAR TESTS (command), 200-202 

J-T subcommand, 202 

METHOD subcommand, 200-201 

MH subcommand, 201-202 

new syntax, 200 

pairing variables, 201 


observed r x c tables, 135-136 
computing exact p value for, 136 
one-sample tests 
binary data, 49-55 
runs test, 51-55 
one-sided p value 
K independent samples, 120, 122 
one-sided p value 
binomial test, 50 
Mann-Whitney test, 82, 84 
McNemar test, 69 
runs test, 92 
sign test, 62, 63 
Wilcoxon signed-ranks test, 62, 63 
ordered alternatives, 115 
ordered variables, 135 


ordinal data 
exact two-sided p values, 168 
gamma, 183-184 


Kendall’s tau, 177-182 

measures of association, 171-184 

Pearson’s product-moment correlation coefficient, 
172-174 

Somers’ d, 177-182 

Spearman’s rank-order correlation coefficient, 
174-176 


p value 

choosing a method, 22-37 

hypothesis testing, 11-14 

in two-sample tests, 80 

measures of association, 168-170 
p value. See also one-sided p value 
p value. See also two-sided p value. 


PAIRED (keyword) 
NPAR TESTS command, 201 
paired samples, 57-73 
when to use each test, 58 
Pearson chi-square 
example: 3 x 4 table, 14-18 
example: fire figher data, 14-18 
example: sparse contingency table, 12-14 
example: sports activity data, 36-37 
Pearson chi-square test, 138, 144-145 
when to use, 141 
Pearson’s product-moment correlation coefficient 
example:social striving data, 30, 172-174 
measures of association, 172—174 
phi 
example, 187—188 
measures of association, 185-188 
point estimates 
measures of assocation, 168 
Poisson sampling, 137 
product multinomial sampling, 137, 143 
proportional reduction in prediction error 
measures of association, 185, 188-191 


proportional reduction in prediction error. See also 
Goodman and Kruskal tau 


proportional reduction in prediction error. See also 
uncertainty coefficient 


r x c contingency tables 
doubly ordered, 155-164 
example: oral lesions data, 143-144 
Jonckheere-Tepstra test, 156-160 
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Kruskal-Wallis test, 149-153 
linear-by-linear association test, 161-164 
observed, 135-136 

reference sets for, 136 

singly ordered, 149-153 

tests on, 135-140 

unordered, 141-148 


random number seed, 9-10 


reference sets, 16-17, 21, 137 
forr x c tables, 136 
runs test, 51-55, 91-94 
example: children’s aggression scores, 53-54 
example: discrimination against female workers, 
92-94 
example: small data set, 54-55 
when to use, 76 


samples 
Monte Carlo method, 8 
SAMPLES (keyword) 
NPAR TESTS command, 200 
sampling 
full multinomial, 137 
Poisson, 137 
product multinomial, 137 
sign test, 59-67 
when to use, 58 
singly ordered contingency tables, 135 
singly ordered contingency tables. See also rx c 
contingency tables 
Somers’ d, 171, 177-182 
example: smoking habit data, 180-182 
measures of association, 177—182 
Spearman’s rank-order correlation coefficient 
example: social striving data, 175-176 
measures of association, 174-176 


test statistics 
defining for r x c tables, 138 
Tests for Several Independent Samples procedure, 
200-202 
grouping variables, 202 
time limit 
setting for exact tests, 9 
TIMER (keyword) 
NPAR TESTS command, 200 


Two-Related-Samples Tests procedure, 201—202 
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two-sample tests 
independent samples, 75-94 
Kolmogorov-Smimov, 87-91 
Mann-Whitney, 80-86 
marginal homogeneity, 71—73 
McNemar, 68—70 
median, 94 
paired samples, 57—73 
runs, 91—94 
sign, 59-67 
Wilcoxon signed-ranks, 59-67 
two-sided p value 
K independent samples, 115, 120, 121 
median test, 124 
two-sided p value 
binomial test, 50 
K related samples, 99, 101 
Kolmogorov-Smimoyv, 88 
Mann-Whitney test, 82, 84 
McNemar test, 69 
rx c tables, 138, 140 
runs test, 52 
sign test, 62, 64 
Wilcoxon signed-ranks test, 62, 64 


uncertainty coefficient 

example: party preference data, 189-191 

measures of association, 185, 189-191 
unordered continous contingency tables, 135 
unordered r x c contingency tables 

See also r x c contingency tables 


Wald-Wolfowitz. See runs test 
Wilcoxon rank-sum test, 11 
Wilcoxon signed-rank test, 11 
Wilcoxon signed-ranks test, 59-67 
example: AZT for AIDS, 64-67 
mid-ranks, 60 
permutational distribution, 60 
when to use, 58 
Wilcoxon-Mann- Whitney test. See Mann-Whitney 
test 
WITH (keyword) 
NPAR TESTS command, 201 


